用微控制器提取根

Extracting root with a microcontroller

在书上看到了用单片机解root的方法。我无法理解这个过程。为什么在下面的代码中 p=0x80?在这种情况下,如果我们有一个小数字,它会做很多无用的循环,不是吗?

unsigned int math_sqrt(unsigned int x)
{ 
  unsigned char ans = 0,p = 0x80;
  while(p!=0){
     ans+=p;
     if(ans*ans>x)
       ans-=p;
     p = (unsigned char)(p/2);
  }
  return ans;
}

我是微控制器和C的新手,抱歉,我的英语表达可能有些问题。顺便问一下,我可以在stm32或K60之类的微控制器中使用math.h吗?谢谢

代码的准确性

Maybe the numbers I used to test are small. What order of magnitude do you think the code can show its advantage?

它适用于 0..65535 范围内的输入。

这里有一些简单的测试代码,您可以 运行 — 好吧,如果您有我的 isqrt_32() 函数,也可以 运行,但是把它砍掉并不难仅使用标准 C 库 sqrt() 函数。

#include <math.h>
#include <stdio.h>
#include "isqrt.h"

static unsigned int math_sqrt(unsigned int x)
{ 
  unsigned char ans = 0,p = 0x80;
  while(p!=0){
     ans+=p;
     if(ans*ans>x)
       ans-=p;
     p = (unsigned char)(p/2);
  }
  return ans;
}

int main(void)
{
    for (int i = 0; i < 70000; i++)
    {
        int v1 = isqrt_32(i);
        int v2 = math_sqrt(i);
        double d1 = sqrt(i);
        int v3 = d1;
        if (v1 != v2 || v1 != v3)
            printf("%5d: %5d vs %5d vs %5d (%9.6f)\n", i, v1, v2, v3, d1);
    }
    return 0;
}

isqrt_32()的签名是:extern int32_t isqrt_32(uint32_t x);

当运行时开始输出:

65536:   256 vs   255 vs   256 (256.000000)
65537:   256 vs   255 vs   256 (256.001953)
65538:   256 vs   255 vs   256 (256.003906)
65539:   256 vs   255 vs   256 (256.005859)
65540:   256 vs   255 vs   256 (256.007812)
65541:   256 vs   255 vs   256 (256.009765)
65542:   256 vs   255 vs   256 (256.011718)
65543:   256 vs   255 vs   256 (256.013672)
65544:   256 vs   255 vs   256 (256.015625)
65545:   256 vs   255 vs   256 (256.017578)

问题中显示的代码与 16 位无符号整数平方根的速度一样快,前提是输入范围不受 16 位无符号整数的限制。

isqrt_32()

中的内容

我去看了isqrt_32()背后的代码,想起了过去的时光。我使用的代码基于 Chris Torek 不晚于 1991 年在 comp.lang.c 新闻组中发布的一些 public 域代码(我从 1991-08-14 中获取了一个重新发布的副本)。

关键样本计时结果是(2011-11-27 on MacOS X 10.7.2 on a 2.3 GHz Intel Core i7 with 8 GB 1333 MHz DDR3 memory):

CT 0.482771 (1370731968); JL 2.074758 (1370731968); MT 0.125003 (1507082656)
CT 0.404681 (1370731968); JL 1.825984 (1370731968); MT 0.086421 (1507082656)
CT 0.509388 (1370731968); JL 1.900582 (1370731968); MT 0.094746 (1507082656)
CT 0.534594 (1370731968); JL 1.866929 (1370731968); MT 0.095474 (1507082656)
CT 0.447996 (1370731968); JL 2.211031 (1370731968); MT 0.120391 (1507082656)

2017 年,我使用更新版本的代码进行了测试,该代码还测试了:

int32_t isqrt_db(uint32_t x)
{
    return sqrt(x);
}

结果表明,在我的测试平台上,使用库函数明显快于使用整数函数(2017-12-23 on macOS 10.13.2 on a 2.7 GHz Intel Core i7 with 16 GiB 2133 Mhz LPDDR3记忆):

CT 0.217224 (1370731968); JL 1.479817 (1370731968); DB 0.125086 (1370731968); MT 0.086683 (1507082656)
CT 0.211790 (1370731968); JL 1.478602 (1370731968); DB 0.129983 (1370731968); MT 0.085262 (1507082656)
CT 0.213528 (1370731968); JL 1.478100 (1370731968); DB 0.124773 (1370731968); MT 0.082611 (1507082656)
CT 0.212352 (1370731968); JL 1.494118 (1370731968); DB 0.130582 (1370731968); MT 0.085290 (1507082656)
CT 0.211442 (1370731968); JL 1.499416 (1370731968); DB 0.127462 (1370731968); MT 0.082682 (1507082656)

2011 年,Chris Torek (CT) 时间约为 JL 时间的 1/4(至 1/5)。 2017年,差距更大,接近1/7。不过2017年系统库sqrt()几乎是CT代码的两倍

MT 时间是一个空的 (em-tee) 平方根函数,它只是 returns 它的输入。它是函数调用开销的衡量标准——它并不重要,也很容易避免。

不过,那是一台有硬件浮点的机器(芯片有浮点平方根指令)。如果微控制器没有浮点单元,或者如果浮点单元没有平方根指令,您可能会发现整数软件版本(明显——可能非常明显)比浮点代码快。

JFTR:2017年的汇编是:

gcc -m64 -I/Users/jleffler/inc -I. -DHAVE_CONFIG_H -DJLSS_STDERR -O3 -fPIC -g -std=c11 -pedantic -Wall -Wextra -Werror -Wshadow -Wmissing-prototypes -Wpointer-arith -Wwrite-strings -Wold-style-definition -Wcast-qual -Wstrict-prototypes -c isqrt.c

非常繁琐,但重点是-O3优化;这是优化的代码。测试代码打印累积总和,部分是为了检查一致性,部分是为了确保优化器不会通过发现相同的事情被执行 10,000 次来消除代码。也就是说,测试代码中的函数指针也减少了优化器过度滥用的机会。 2017-12-23 之前的代码版本没有使用指向函数测试的指针,更容易受到极端优化的影响。

大部分代码都可以在我的 SOQ(Stack Overflow Questions)存储库中的 GitHub 上找到,作为文件在 src/libsoq 子目录。文件 isqrt32.c 基于旧版本的代码 如下所示;文件 isqrt64.c 有点相似。我需要做一些更新。

TL;DR — 性能重要时的基准

  • 这再次证明了在性能至关重要时对机器进行基准测试的重要性。

此代码的许可证是 CC-by-SA 3.0 with attribution required,与 SO 的其余部分一样 — 请参阅 SO 上任何页面的底部。

/*
@(#)File:           $RCSfile: isqrt.c,v $
@(#)Version:        $Revision: 1.14 $
@(#)Last changed:   $Date: 2017/12/23 17:12:52 $
@(#)Purpose:        Integer square root calculation
@(#)Author:         J Leffler
@(#)Copyright:      (C) JLSS 1991-2017
*/

/*TABSTOP=4*/

#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
extern const char jlss_id_isqrt_c[];
const char jlss_id_isqrt_c[] = "@(#)$Id: isqrt.c,v 1.14 2017/12/23 17:12:52 jleffler Exp $";
#endif /* lint */

/* Configuration: use public domain implementation unless -DUSE_JLSS_ISQRT */
#undef USE_PUBLIC_DOMAIN
#ifndef USE_JLSS_ISQRT
#define USE_PUBLIC_DOMAIN
#endif /* USE_JLSS_ISQRT */

#include "isqrt.h"

#if defined(USE_TIMING_TESTS)
extern int32_t isqrt_jl(uint32_t x);
extern int32_t isqrt_ct(uint32_t x);
#else
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-macros"
#define isqrt_jl(x) isqrt_32(x)
#define isqrt_ct(x) isqrt_32(x)
#pragma GCC diagnostic pop
#endif /* USE_TIMING_TESTS */

#if !defined(USE_PUBLIC_DOMAIN) || defined(USE_TIMING_TESTS)
/*
** Algorithm by J Leffler: slow but correct
*/
int32_t isqrt_jl(uint32_t x)
{
    uint32_t x1;
    uint32_t x2;

    /* Special cases:              */
    /* x == 0 => return 0 CORRECT */
    /* x <  4 => return 1 CORRECT */
    if (x < 4)
        return(x > 0);

    x1 = x / 4;
    while ((x2 = ((x / x1 + x1) / 2)) != x1)
    {
        if (x2 == x1 + 1 && x2 > x / x2 && x1 < x / x1)
            return(x1);
        x1 = x2;
    }
    return(x1);
}
#endif /* !USE_PUBLIC_DOMAIN || USE_TIMING_TESTS */

#if defined(USE_PUBLIC_DOMAIN) || defined(USE_TIMING_TESTS)
/*
** Code by Chris Torek: fast and correct
*/
/*
** From: edelsohn@sccs.syr.edu (David Edelsohn)
** Subject: Re: quick sqrt()
** Message-ID: <1991Aug14.161849.18548@rodan.acs.syr.edu>
** Date: 14 Aug 91 20:18:49 GMT
** Organization: Syracuse Center for Computational Science/Dept of Physics
**
** I tried replying to the poster but my email bounced and Chris Torek
** has not jumped in with his previous solution, so I will repost it for him:
** >From: chris@mimsy.umd.edu (Chris Torek)
** >Subject: Re: Integer Square Root (Was Re: # to the nth power)
**
** Integer square root routine, good for up to 32-bit values.
** Note that the largest square root (that of 0xffffffff)is
** 0xffff, so the result fits in a regular unsigned and need
** not be `long'.
**
** Original code from Tomas Rokicki (using a well known algorithm).
** This version by Chris Torek, University of Maryland.
**
** This code is in the public domain.
*/
int32_t isqrt_ct(uint32_t v)
{
    uint32_t t = 1L << 30;
    uint32_t r = 0;
    uint32_t s;

#undef STEP
#define STEP(k) \
    s = t + r; \
    r >>= 1; \
    if (s <= v) { \
        v -= s; \
        r |= t; \
    }

    STEP(15);
    t >>= 2;
    STEP(14);
    t >>= 2;
    STEP(13);
    t >>= 2;
    STEP(12);
    t >>= 2;
    STEP(11);
    t >>= 2;
    STEP(10);
    t >>= 2;
    STEP(9);
    t >>= 2;
    STEP(8);
    t >>= 2;
    STEP(7);
    t >>= 2;
    STEP(6);
    t >>= 2;
    STEP(5);
    t >>= 2;
    STEP(4);
    t >>= 2;
    STEP(3);
    t >>= 2;
    STEP(2);
    t >>= 2;
    STEP(1);
    t >>= 2;
    STEP(0);
    return (int32_t)r;
}
#endif /* USE_PUBLIC_DOMAIN || USE_TIMING_TESTS */

#if defined(TEST)
#include <stdio.h>
#include <math.h>
#include <inttypes.h>

#if defined(USE_TIMING_TESTS)
/*
** Representative timings - (2011-11-27 on MacOS X 10.7.2 on a 2.3 GHz Intel Core i7
** with 8 GB 1333 MHz DDR3 memory).
**
** CT 0.482771 (1370731968); JL 2.074758 (1370731968); MT 0.125003 (1507082656)
** CT 0.404681 (1370731968); JL 1.825984 (1370731968); MT 0.086421 (1507082656)
** CT 0.509388 (1370731968); JL 1.900582 (1370731968); MT 0.094746 (1507082656)
** CT 0.534594 (1370731968); JL 1.866929 (1370731968); MT 0.095474 (1507082656)
** CT 0.447996 (1370731968); JL 2.211031 (1370731968); MT 0.120391 (1507082656)
**
** With the augmented test timing the conversion to double, use library
** square root, and convert back to integer too, the results (2017-12-23
** on macOS 10.13.2 on a 2.7 GHz Intel Core i7 with 16 GiB 2133 Mhz
** LPDDR3 memory) were:
**
** CT 0.217224 (1370731968); JL 1.479817 (1370731968); DB 0.125086 (1370731968); MT 0.086683 (1507082656)
** CT 0.211790 (1370731968); JL 1.478602 (1370731968); DB 0.129983 (1370731968); MT 0.085262 (1507082656)
** CT 0.213528 (1370731968); JL 1.478100 (1370731968); DB 0.124773 (1370731968); MT 0.082611 (1507082656)
** CT 0.212352 (1370731968); JL 1.494118 (1370731968); DB 0.130582 (1370731968); MT 0.085290 (1507082656)
** CT 0.211442 (1370731968); JL 1.499416 (1370731968); DB 0.127462 (1370731968); MT 0.082682 (1507082656)
**
** Note that on a machine with hardware floating point, the system
** library is close to twice as fast as the CT algorithm, which is about
** 7 times as fast as the JL algorithm.  If the CPU doesn't have
** hardware floating point, then the integer algorithms will be far
** faster than simulated floating point.  This does demonstrate the
** importance of benchmarking, though.
*/

#include "timer.h"

enum { ITERATIONS = 100000 };

static inline uint32_t next_x(uint32_t x)
{
    return (234ULL * x) / 213ULL + 1;
}

extern int32_t isqrt_mt(uint32_t x);
extern int32_t isqrt_db(uint32_t x);

/* Empty isqrt function - measure function call/return overhead */
int32_t isqrt_mt(uint32_t x)
{
    return x;
}

/* Double isqrt function - convert to double, use standard sqrt function, convert back */
int32_t isqrt_db(uint32_t x)
{
    /* Shorthand for: double d = x; d = sqrt(d); x = d; return x; */
    return sqrt(x);
}

typedef int32_t (*Sqrt_Fn)(uint32_t x);

typedef struct Time_Test
{
    Sqrt_Fn     fn;
    const char *prefix;
    const char *tag;
} Time_Test;

static const Time_Test tests[] =
{
    {   isqrt_ct,   "", "CT" },     /* Public domain (Chris Torek) algorithm */
    {   isqrt_jl, "; ", "JL" },     /* JLSS algorithm */
    {   isqrt_db, "; ", "DB" },     /* Double algorithm */
    {   isqrt_mt, "; ", "MT" },     /* Empty algorithm */
};
enum { NUM_TESTS = sizeof(tests) / sizeof(tests[0]) };

static void time_isqrt_function(Sqrt_Fn fn, Clock *clk, uint32_t *acc)
{
    clk_start(clk);
    for (uint32_t i = 0; i < ITERATIONS; i++)
    {
        for (uint32_t x = 0; x < UINT32_MAX / 2; x = next_x(x))
        {
            int32_t v = (*fn)(x);
            *acc += v + i;
        }
    }
    clk_stop(clk);
}

static void time_test(const Time_Test *test)
{
    Clock clk;
    uint32_t acc = 0;
    char buffer[32];
    time_isqrt_function(test->fn, &clk, &acc);
    printf("%s%s %8s (%" PRIu32 ")", test->prefix, test->tag,
           clk_elapsed_us(&clk, buffer, sizeof(buffer)), acc);
}

int main(void)
{
    for (int i = 0; i < NUM_TESTS; i++)
        time_test(&tests[i]);
    putchar('\n');
    return 0;
}

#elif defined(USE_PUBLIC_DOMAIN)
int main(void)
{
    uint32_t l;
    char            buf[100];

    for (;;)
    {
        (void)printf("gimme a number> ");
        if (fgets(buf, sizeof buf, stdin) == NULL)
            break;
        /* should use strtoul here but some do not have it */
        if (sscanf(buf, "0x%" SCNx32, &l) != 1 &&
            sscanf(buf, "0%"  SCNo32, &l) != 1 &&
            sscanf(buf, "%"   SCNu32, &l) != 1 &&
            sscanf(buf, "%"   SCNx32, &l) != 1)
            (void)printf("that was not a number\n");
        else
            (void)printf("root(%" PRIu32 ") => %" PRIu32 " sqrt(%" PRIu32 ") => %.17g\n",
                          l, isqrt_32(l), l, sqrt((double)l));
    }
    putchar('\n');
    return(0);
}

#else

static const uint32_t tests[] = {
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 24, 25, 9999, 10000,
    4294705155, 4294705156, 4294705157, 0xFFFFFFFE, 0xFFFFFFFF,
};

#define DIM(x)  (sizeof(x)/sizeof(*(x)))

int main(void)
{
    int             i;

    for (i = 0; i < DIM(tests); i++)
    {
        uint32_t r = isqrt_32(tests[i]);
        assert(r * r <= tests[i]);
        // assert((r+1) * (r+1) > tests[i]);
        // But avoid overflows!
        assert(tests[i] / (r + 1) < (r + 1));
        printf("ISQRT(%" PRIu32 ") = %" PRId32 "\n", tests[i], r);
    }
    return(0);
}

#endif /* USE_PUBLIC_DOMAIN */
#endif /* TEST */

只是对 Jonathan 出色回答的数学补充。

算法之外的基本原理是 uint16_t 的平方根是 uint8_t,这意味着 0 到 255 之间的数字。因为函数平方和平方根都是单调的,您可以使用二分法找到最接近的解决方案。您只需从可能间隔的中间开始,即 128 或 0x80。然后你每次迭代一半大小的间隔,这允许很快(最多 8 个阶段)到大小 1 的 interval:解决方案。