使用 Newton-Raphson 方法在 x87 FPU 上立方根

Question

我正在尝试使用 8086 处理器编写一个汇编程序来计算一个数的立方根。显然我使用的是浮点数。

root := 1.0; 
repeat
     oldRoot := root;
     root := (2.0*root + x/(root*root)) / 3.0 
until ( |root – oldRoot| < 0.001;

如何将 (2*root + x) 除以 (root*root)？

.586
.MODEL FLAT
.STACK 4096

.DATA
root    REAL4   1.0
oldRoot REAL4   2.0
Two     REAL4   2.0
inttwo  DWORD   2
itThree DWORD   3
three   REAL4   3.0
x       DOWRD   27


.CODE
main    PROC
        finit           ; initialize FPU
        fld     root    ; root in ST
        fmul    two     ; root*two
        fadd    x       ; root*two+27

        fld     root    ; root in ST
        fimul    root    ; root*root

        mov     eax, 0  ; exit
        ret
main    ENDP 
END

我想我不明白堆栈中的什么位置。行

的产品是否

fimul root ; root*root

进入ST(1)？编辑不，它进入 st(0) st(0) 中的内容被推入堆栈到 st(1)

但是我还没有找到问题的答案... 我该如何除法？现在我明白了我需要除法 st(1 ) by st(0) 但我不知道怎么做。 我试过了。

finit           ; initialize FPU
fld     root    ; root in ST
fmul    two     ; root*two
fadd    xx      ; root*two+27
; the answer to root*two+x is stored in ST(0) when we load root st(0) moves to ST1 and we will use ST0 for the next operation

fld     root    ; root in ST previous content is now in ST1
fimul   root    ; root*root
fidiv   st(1)

编辑：我把公式写错了。这就是我要找的。

(2.0*root) + x / (root*root)) / 3.0 That's what I need. 
STEP 1) (2 * root) 
STEP 2) x / (root * root) 
STEP 3) ADD step one and step 2 
STEP 4) divide step 3 by 3.0

root = (2.0*1.0) + 27/(1.0*1.0) / 3.0 ; (2) + 27/(1.0) / 3.0 = 11 ==> 根 = 11

EDIT2：新代码！！

.586
.MODEL FLAT
.STACK 4096

.DATA
root    REAL4   1.0
oldRoot REAL4   2.0
Two     REAL4   2.0
three   REAL4   3.0
xx      REAL4   27.0


.CODE
main    PROC
        finit           ; initialize FPU
                fld     root    ; root in ST    ; Pass 1 ST(0) has 1.0  
repreatAgain:
        ;fld    st(2)

        fmul    two     ; root*two      ; Pass 1 ST(0) has 2                                                                            Pass 2 ST(0) = 19.333333 st(1) = 3.0 st(2) = 29.0 st(3) = 1.0

        ; the answer to roor*two is stored in ST0 when we load rootSTO moves to ST1 and we will use ST0 for the next operation
        fld     root    ; root in ST(0) previous content is now in ST(1)      Pass 1 ST(0) has 1.0 ST(1) has 2.0                        Pass 2 st(
        fmul    st(0), st(0)    ; root*root                                 ; Pass 1 st(0) has 1.0 st(1) has 2.0
        fld     xx                                                          ; Pass 1 st(0) has 27.0 st(1) has 1.0 st(2) has 2.0
        fdiv    st(0), st(1) ; x / (root*root)  ; Pass 1: 27 / 1              Pass 1 st(0) has 27.0 st(1) has 2.0 st(2) has 2.0
        fadd    st(0), st(2) ; (2.0*root) + x / (root*root))                  Pass 1 st(0) has 29.0 st(1) has 1.0 st(2) has 2.0

        fld     three                                                       ; Pass 1 st(0) has 3.0 st(1) has 29.0 st(2) has 1.0 st(3) has 2.0

        fld     st(1)                                                       ; Pass 1 st(0) has 3.0 st(1) has 29.0 st(2) = 1.0 st(3) = 2.0
        fdiv    st(0), st(1) ; (2.0*root) + x / (root*root)) / 3.0            Pass 1 st(1) has 9.6666666666667



        jmp     repreatAgain
        mov     eax, 0  ; exit
        ret
main    ENDP 
END

Answer 1

Intel 的 insn 参考手册记录了所有说明，包括 fdiv and fdivr (x/y instead of y/x). If you really need to learn mostly-obsolete x87 (fdiv) instead of SSE2 (divss), then this x87 tutorial is essential reading, esp. the early chapter that explains the register stack. Also see this x87 FP comparison Q&A. See more links in the x86 标签 wiki。

回复：EDIT2 代码转储：

循环内有 4 条 fld 指令，但没有 p 后缀的操作。您的循环将在第 3 次迭代时溢出 8 寄存器 FP 堆栈，此时您将得到一个 NaN。（具体来说，不确定值 NaN，printf 打印为 1#IND.

我建议 设计你的循环，使迭代从 st(0) 中的 root 开始，并以 [=22] 中下一次迭代的 root 值结束=]。不要在循环内加载或存储 to/from root。使用 fld1 加载 1.0 作为循环外的初始值，并在循环后使用 fstp [root] 将 st(0) 弹出到内存中。

你选择了最不方便的方式来做 tmp/3.0

                          ; stack = tmp   (and should otherwise be empty once you fix the rest of your code)
    fld     three         ; stack = 3.0, tmp
    fld     st(1)         ; stack = tmp, 3.0, tmp   ; should have used fxchg to just swap instead of making the stack deeper
    fdiv    st(0), st(1)  ; stack = tmp/3.0, 3.0, tmp

fdiv、fsub等有多种寄存器-寄存器形式：一种是st(0)是目的地，另一种是源。以 st(0) 作为来源的表格也可用于 pop，因此您可以

    fld     three         ; stack = 3.0, tmp
    fdivp                 ; stack = tmp / 3.0  popping the stack back to just one entry
    ; fdivp  st(1), st(0) ; this is what fdivp with no operands means

其实比直接使用内存操作数而不是加载它更简单。既然你想要st(0) /= 3.0，你可以做到fdiv [three]。在那种情况下，FP 操作就像整数操作一样，您可以在其中执行 div dword ptr [integer_from_memory] 以使用内存源操作数。

非交换运算（减法和除法）也有相反的版本（例如 fdivr），它可以为您节省 fxchg 或让您使用内存操作数，即使您愿意需要 3.0/tmp 而不是 tmp/3.0

除以3等同于乘以1/3，fmul比fdiv快很多。从代码简单的角度来看，乘法是可交换的，所以另一种实现 st(0) /= 3 的方法是：

fld    [one_third]
fmulp                  ; shorthand for  fmulp st(1), st(0)

; or
fmul   [one_third]

请注意，1/3.0 在二进制浮点数中没有精确表示，但 +/- 大约 2^23 之间的所有整数都可以（单精度 REAL4 的尾数大小）。如果您希望使用三的精确倍数，您应该只关心这一点。

原代码评论：

您可以通过提前执行 2.0 / 3.0 和 x/3.0 将除法提升到循环之外。如果您希望循环平均运行超过一次迭代，那么这是值得的。

您可以使用 fld st(0) 复制堆栈的顶部，因此您不必一直从内存中加载。

fimul [root] (integer mul) 是一个错误：您的 root 是 REAL4（32 位浮点数）格式，而不是整数。 fidiv 同样是一个错误，当然不适用于 x87 寄存器作为源操作数。

由于堆栈顶部有 root，我认为您可以 fmul st(0) 将 st(0) 用作显式和隐式操作数，从而得到 st(0) = st(0) * st(0)，堆栈深度没有变化。

您也可以使用 sqrt 作为 比 1.0 或 +/-1 * sqrtf(fabsf(x)) 更好的初始近似值。我没有看到将一个浮点数的符号应用于另一个的 x87 指令，只是 fchs 无条件翻转，fabs 无条件清除符号位。有一个 fcmov，但需要 P6 或更高版本 CPU。您提到了 8086，但后来使用了 .586，所以我知道您的目标是什么。

更好的循环体：

未调试或测试，但您的代码充满了来自相同数据的重复加载，这让我抓狂。这个优化版本在这里是因为我很好奇，而不是因为我认为它会直接帮助 OP。

另外，希望这是一个很好的例子，说明如何在代码中注释棘手的数据流。（例如 x87，或带有随机播放的矢量化代码）。

## x/3.0 in st(1)
## 2.0/3.0 in st(2)

# before each iteration: st(0) = root
#  after each iteration: st(0) = root * 2.0/3.0 + (x/3.0 / (root*root)), with constants undisturbed

loop_body:
    fld     st(0)         ; stack: root, root, 2/3, x/3
    fmul    st(0), st(0)  ; stack: root^2, root, 2/3, x/3
    fdivr   st(0), st(3)  ; stack: x/3 / root^2, root, 2/3, x/3
    fxchg   st(1)         ; stack: root, x/3/root^2, 2/3, x/3
    fmul    st(0), st(2)  ; stack: root*2/3, x/3/root^2, 2/3, x/3
    faddp                 ; stack: root*2/3 + x/3/root^2, 2/3, x/3

; TODO: compare and loop back to loop_body

    fstp    [root]         ; store and pop
    fstp    st(0)          ; pop the two constants off the FP stack to empty it before returning
    fstp    st(0)
    ; finit is very slow, ~80cycles, don't use it if you don't have to.

32 位函数调用约定 return FP 结果为 st(0)，因此您可以这样做，但调用者可能必须存储在某处。

Answer 2

我将在非常基本的层面上为那些可能面临需要在 FPU 上完成的计算的 x87 新手回答这个问题。

有两件事需要考虑。如果给你一个计算（INFIX notation），比如：

root := (2.0*root + x/(root*root)) / 3.0

有没有办法将其转化为 x87 FPU 可以使用的基本指令？是的，在非常基本的层面上，x87 FPU 是一个堆栈，就像一个复杂的 RPN 计算器。代码中的等式是 INFIX 表示法。如果将其转换为 POSTFIX(RPN) 表示法，它可以很容易地实现为具有操作的堆栈。

此 document 提供了一些关于转换为 POSTFIX 表示法的信息。按照规则，您的 POSTFIX 等价物看起来像：

2.0 root * x root root * / + 3.0 /

您可以使用 root=1 和 x=27:

中的这些键将其直接放入旧的 RPN 计算器 (HP) 中，例如 HP 15C

2.0 [enter] root * x [enter] root [enter] root * / + 3.0 /

在线 HP 15C 应显示该计算结果为 9.667。将其转换为基本的 x87：

一个数字是压栈顶（fld）
变量是压栈顶（fld）
*是fmulp（ST(1)乘以ST(0)，结果存入ST(1)，弹出寄存器栈）
/为fdivp（ST(1)除以ST(0)，结果存入ST(1)，弹出寄存器栈）
+是faddp（ST(0)与ST(1)相加，结果存入ST(1)，弹出寄存器栈）
-是fsubp（ST(1)减ST(0)，结果存入ST(1)，弹出寄存器栈）

你可以直接将2.0 root * x root root * / + 3.0 /转换成x87指令：

fld Two      ; st(0)=2.0
fld root     ; st(0)=root, st(1)=2.0
fmulp        ; st(0)=(2.0 * root)
fld xx       ; st(0)=x, st(1)=(2.0 * root)
fld root     ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
fld root     ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
fmulp        ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
fdivp        ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
faddp        ; st(0)=(2.0 * root) + (x / (root * root))
fld Three    ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
fdivp        ; st(0)=((2.0 * root) + (x / (root * root))) / 3.0

掌握了基础知识后，就可以继续提高效率了。

关于编辑 2 / 后续问题

要记住的一件事是，如果您不使用从堆栈中弹出值的指令，则循环的每次迭代都会消耗更多的 FPU 堆栈槽。通常，以 P 结尾的 FPU 指令将值弹出堆栈。您不使用任何指令从堆栈中删除项目，FPU 堆栈不断增长。

与用户 space 中的程序堆栈不同，FPU 堆栈非常有限，因为它只有 8 个槽。如果将超过 8 个有效值放入堆栈，则会出现 1#IND 形式的溢出错误。如果我们在每条指令后分析您的代码并查看堆栈，我们会发现：

    fld     root            ; st(0)=root  
repreatAgain:
    fmul    two             ; st(0)=(2.0*root)      
    fld     root            ; st(0)=root, st(1)=(2.0*root) 
    fmul    st(0), st(0)    ; st(0)=(root*root), st(1)=(2.0*root)
    fld     xx              ; st(0)=x, st(1)=(root*root), st(2)=(2.0*root)
    fdiv    st(0), st(1)    ; st(0)=(x/(root*root)), st(1)=(root*root), st(2)=(2.0*root)
    fadd    st(0), st(2)    ; st(0)=((2.0*root) + x/(root*root)), st(1)=(root*root), st(2)=(2.0*root)
    fld     three           ; st(0)=3.0, st(1)=((2.0*root) + x/(root*root)), st(2)=(root*root), st(3)=(2.0*root)                                            
    fld     st(1)           ; st(0)=((2.0*root) + x/(root*root)), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root), st(4)=(2.0*root)
    fdiv    st(0), st(1)    ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root), st(4)=(2.0*root)
    jmp     repreatAgain

观察在最后一个 FDIV 指令之后和 JMP 指令之前，我们在堆栈上有 5 个项目（st (0) 到 st(4))。当我们进入循环时，我们只有 1 个 root in st(0)。解决此问题的最佳方法是以这样一种方式使用指令，即随着计算的进行，值会从堆栈中弹出（移除）。

另一种效率较低的方法是在重复循环之前释放堆栈中不再需要的值。 FFREE 指令可用于此目的，方法是从堆栈底部开始手动标记未使用的条目。如果您在上面的代码之后添加这些行，并且在 jmp repreatAgain 之前，代码应该可以工作：

ffree   st(4)           ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root)
ffree   st(3)           ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root))
ffree   st(2)           ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0
ffree   st(1)           ; st(0)=(((2.0*root) + x/(root*root))/3.0)
fst     root            ; Update root variable
jmp     repreatAgain

通过使用 FFREE 指令，我们在 st(0) 中仅使用新的 root 结束循环.

由于您的计算方式，我还添加了 fst root。您的计算包括 fld root，它依赖于每个循环完成时更新 root 中的值。有一种更有效的方法可以做到这一点，但我提供了一个无需大量修改即可在您当前代码中运行的修复程序。

如果您使用我之前提供的 inefficient/simple 代码片段进行计算，您最终会得到如下代码：

    finit        ; initialize FPU
repreatAgain:
    fld Two      ; st(0)=2.0
    fld root     ; st(0)=root, st(1)=2.0
    fmulp        ; st(0)=(2.0 * root)
    fld xx       ; st(0)=x, st(1)=(2.0 * root)
    fld root     ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
    fld root     ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
    fmulp        ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
    fdivp        ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
    faddp        ; st(0)=(2.0 * root) + (x / (root * root))
    fld Three    ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
    fdivp        ; newroot = st(0)=((2.0 * root) + (x / (root * root))) / 3.0
    fstp root    ; Store result at top of stack into root and pop value
                 ;     at this point the stack is clear again since
                 ;     all items pushed have been popped.

    jmp repreatAgain

此代码不需要 FFREE，因为随着计算的进行，元素会从堆栈中弹出。指令FADDP、FSUBP、FDIVP、FADDP将另外弹出堆栈顶部的值。这具有使堆栈清除部分中间计算的副作用。

整合循环

要将循环集成到我之前创建的 simple/inefficient 代码中，您可以使用 FCOM (Floating point compare) 的变体进行比较。浮点比较的结果然后是 transferred/converted 到常规 CPU 标志 (EFLAGS)。然后可以使用常规比较运算符来执行条件检查。代码可能如下所示：

epsilon REAL4   0.001

.CODE
main PROC
    finit              ; initialize FPU

repeatAgain:
    fld Two            ; st(0)=2.0
    fld root           ; st(0)=root, st(1)=2.0
    fmulp              ; st(0)=(2.0 * root)
    fld xx             ; st(0)=x, st(1)=(2.0 * root)
    fld root           ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
    fld root           ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
    fmulp              ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
    fdivp              ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
    faddp              ; st(0)=(2.0 * root) + (x / (root * root))
    fld Three          ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
    fdivp              ; newroot=st(0)=((2.0 * root) + (x / (root * root))) / 3.0
    fld root           ; st(0)=oldroot, st(1)=newroot
    fsub st(0), st(1)  ; st(0)=(oldroot-newroot), st(1)=newroot
    fabs               ; st(0)=(|oldroot-newroot|), st(1)=newroot
    fld epsilon        ; st(0)=0.001, st(1)=(|oldroot-newroot|), st(2)=newroot
    fcompp             ; Do compare&set x87 flags pop top two values off stack
                       ;     st(0)=newroot    
    fstsw ax           ; Copy x87 Status Word containing the result to AX
    fwait              ; Insure previous instruction completed
    sahf               ; Transfer condition codes to the CPU's flags register

    fstp root          ; Store result (newroot) at top of stack into root 
                       ;     and pop value. At this point the stack is clear
                       ;     again since all items pushed have been popped.
    jbe repeatAgain    ; If 0.001 <= (|oldroot-newroot|) repeat
    mov eax, 0         ; exit
    ret
main    ENDP 
END

注意：FCOMPP 的使用和 x87 标志到 CPU 标志的手动传输是由于您有 .586 指令在你的代码的顶部。我假设因为您没有指定 .686 或更高版本，所以像 FCOMI 这样的指令不可用。如果您使用的是 .686 或更高版本，那么代码的底部可能如下所示：

fld root           ; st(0)=oldroot, st(1)=newroot
fsub st(0), st(1)  ; st(0)=(oldroot-newroot), st(1)=newroot
fabs               ; st(0)=(|oldroot-newroot|), st(1)=newroot
fld epsilon        ; st(0)=0.001, st(1)=(|oldroot-newroot|), st(2)=newroot
fcomip st(0),st(1) ; Do compare & set CPU flags, pop one value off stack
                   ;     st(0)=(|oldroot-newroot|), st(1)=newroot
fstp st(0)         ; Pop temporary value off top of stack
                   ;     st(0)=newroot

fstp root          ; Store result (newroot) at top of stack into root 
                   ;     and pop value. At this point the stack is clear
                   ;     again since all items pushed have been popped.
jbe repeatAgain    ; If 0.001 <= (|oldroot-newroot|) repeat

从中缀表示法创建 RPN/Postfix 的快速方法

如果学习将 Infix 符号转换为 RPN/Postfix 从我之前在问题中链接的文档来看似乎有点令人生畏有一些缓解。有许多网站可以为您完成这项工作。 MathBlog 就是这样一个网站。只需输入您的等式，点击转换，它就会显示等价于 RPN/Postfix 的结果。仅限于+-/*，括号和指数带^.

优化

优化代码的一个重要关键是通过将每个循环之间保持不变的部分与可变部分分开来优化公式。常数部分可以在循环开始之前计算。

你原来的等式是这样的：

分离常量部分我们可以得到：

如果我们用 twothirds = 2.0/3.0 和 xover3 = x/3 中的标识符替换常量，那么我们最终会得到如下所示的简化等式：

如果我们将其转换为 POSTFIX/RPN 那么我们得到：

twothirds root * xover3 root root * / +

彼得在更好的循环体部分的回答中利用了类似的优化。他将常量 Twothirds 和 Xover3 放在循环外的 x87 FPU 堆栈上，并在循环内根据需要引用它们。这避免了每次循环都不必要地从内存中重新读取它们。

基于上述优化的更完整示例：

.586
.MODEL FLAT
.STACK 4096

.DATA
xx        REAL4   27.0
root      REAL4   1.0
Three     REAL4   3.0
epsilon   REAL4   0.001
Twothirds REAL4 0.6666666666666666

.CODE
main PROC
    finit               ; Initialize FPU
    fld epsilon         ; st(0)=epsilon
    fld root            ; st(0)=prevroot (Copy of root), st(1)=epsilon
    fld TwoThirds       ; st(0)=(2/3), st(1)=prevroot, st(2)=epsilon 
    fld xx              ; st(0)=x, st(1)=(2/3), st(2)=prevroot, st(3)=epsilon
    fdiv Three          ; st(0)=(x/3), st(1)=(2/3), st(2)=prevroot, st(3)=epsilon
    fld st(2)           ; st(0)=root, st(1)=(x/3), st(2)=(2/3), st(3)=prevroot, st(4)=epsilon

repeatAgain:

    ; twothirds root * xover3 root root * / +
    fld st(0)           ; st(0)=root, st(1)=root, st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
    fmul st(0), st(3)   ; st(0)=(2/3*root), st(1)=root, st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon           
    fxch                ; st(0)=root, st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
    fmul st(0), st(0)   ; st(0)=(root*root), st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
    fdivr st(0), st(2)  ; st(0)=((x/3)/(root*root)), st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
    faddp               ; st(0)=((2/3*root)+(x/3)/(root*root)), st(1)=(x/3), st(2)=(2/3), st(3)=prevroot, st(4)=epsilon
    fxch st(3)          ; st(0)=prevroot, st(1)=(x/3), st(2)=(2/3), newroot=st(3)=((2/3*root)+(x/3)/(root*root)), st(4)=epsilon 
    fsub st(0), st(3)   ; st(0)=(prevroot-newroot), st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
    fabs                ; st(0)=(|prevroot-newroot|), st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
    fld st(4)           ; st(0)=0.001, st(1)=(|prevroot-newroot|), st(2)=(x/3), st(3)=(2/3), st(4)=newroot, st(5)=epsilon

    fcompp              ; Do compare&set x87 flags pop top two values off stack
                        ;     st(0)=(x/3), st(1)=(2/3), st(2)=newroot, st(3)=epsilon    
    fstsw ax            ; Copy x87 Status Word containing the result to AX
    fwait               ; Insure previous instruction completed
    sahf                ; Transfer condition codes to the CPU's flags register

    fld st(2)           ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
    jbe repeatAgain     ; If 0.001 <= (|oldroot-newroot|) repeat

    ; Remove temporary values on stack, cubed root in st(0)
    ffree st(4)         ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3), st(3)=epsilon
    ffree st(3)         ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3)
    ffree st(2)         ; st(0)=newroot, st(1)=(x/3)
    ffree st(1)         ; st(0)=newroot

    mov     eax, 0  ; exit
    ret
main ENDP 

END

此代码在进入循环之前将这些值放在堆栈上：

st(4) = Epsilon 值 (0.001)
st(3) = 计算完成前 root 的副本（有效 prevroot）
st(2) = 常数 Twothirds (2/3)
st(1) = Xover3 (x/3)
st(0) = root

在循环重复之前，堆栈将具有上面的布局。

退出前最后的代码删除了所有临时值，并简单地在 st(0) 的顶部留下值 root 的堆栈。

使用 Newton-Raphson 方法在 x87 FPU 上立方根

Cube root on x87 FPU using Newton-Raphson method

x86

assembly

masm

newtons-method

x87

原代码评论：

更好的循环体：

关于编辑 2 / 后续问题

整合循环

从中缀表示法创建 RPN/Postfix 的快速方法

优化