floating-point/integer 类型转换的可靠溢出检测
Reliable overflow detection of floating-point/integer type conversion
是否有安全的方法来可靠地确定整数类型 T
是否可以存储浮点整数值 f
(因此 f == floor(f)
)而不会发生任何溢出?
请记住,不能保证浮点类型 F
与 IEC 559 (IEEE 754) 兼容,并且有符号整数溢出是 未定义行为在 C++ 中。我对根据当前 C++(写作时为 C++17)标准正确并避免 未定义行为.
的解决方案感兴趣
以下天真的方法并不可靠,因为无法保证类型 F
可以表示 std::numeric_limits<I>::max()
由于浮点舍入。
#include <cmath>
#include <limits>
#include <type_traits>
template <typename I, typename F>
bool is_safe_conversion(F x)
{
static_assert(std::is_floating_point_v<F>);
static_assert(std::is_integral_v<I>);
// 'fmax' may have a different value than expected
static constexpr F fmax = static_cast<F>(std::numeric_limits<I>::max());
return std::abs(x) <= fmax; // this test may gives incorrect results
}
有什么想法吗?
你能不能不做
static_cast<F>(static_cast<I>(x)) == floor(x)
?
Any idea?
template <typename I, typename F>
constexpr F maxConvertible()
{
I i = std::numeric_limits<I>::max();
F f = F(i);
while(F(i) == f)
{
--i;
}
return F(i);
}
由于四舍五入,我们可能得到了一个太大的最大值,现在递减计数直到我们得到下一个更小的可表示双精度数,这应该适合积分...
问题悬而未决:如果转换为 double 涉及四舍五入,这工作正常;然而,即使 IEEE 754 也允许不同的舍入模式(如果应用舍入到最接近的舍入,这应该是当前硬件中最常见的舍入模式,向上舍入将始终发生......)。
我还没有找到安全检测向下舍入的解决方案(稍后可能会添加;至少检测 "rounding to nearest" 已经有解决方案 here),如果发生这种情况,我们会得到一些负面信息如果在积分值的最大值和最小值附近出现错误,您可能会考虑 "acceptable" 对于那些实际上进行向下舍入的少数异国情调的体系结构。
独立于向上舍入或向下舍入,有符号整数无论如何都有一个特殊情况:如果整数以二进制补码表示并且位数多于浮点值的尾数,则类型最小value 将表示为浮点值,而一些更大的值则不会。抓住这种情况需要特殊处理。
Is there a safe way to reliably determine if an integral type T can store a floating-point integer value f?
是的。关键是使用浮点数学测试 f
是否在 T::MIN - 0.999...
到 T::MAX + 0.999...
范围内——没有舍入问题。奖励:舍入模式不适用。
有 3 种失败路径:太大、太小、不是数字。
The below assumes int/double
. I'll leave the C++ template forming for OP.
使用浮点数学精确地形成精确的 T::MAX + 1
很容易,因为 INT_MAX
是 Mersenne Number。 (我们在这里不是在谈论 Mersenne Prime。)
代码利用:
梅森数 除以 2 的整数数学也是 梅森数.
整数类型的2次幂常量到浮点类型的转换可以确定精确。
#define DBL_INT_MAXP1 (2.0*(INT_MAX/2+1))
// Below needed when -INT_MAX == INT_MIN
#define DBL_INT_MINM1 (2.0*(INT_MIN/2-1))
形成精确的T::MIN - 1
很难,因为它的绝对值通常是2的幂+1,而且整数类型和FP类型的相对精度是不确定的。相反,代码可以减去 2 的精确幂并与 -1 进行比较。
int double_to_int(double x) {
if (x < DBL_INT_MAXP1) {
#if -INT_MAX == INT_MIN
// rare non-2's complement machine
if (x > DBL_INT_MINM1) {
return (int) x;
}
#else
if (x - INT_MIN > -1.0) {
return (int) x;
}
#endif
Handle_Underflow();
} else if (x > 0) {
Handle_Overflow();
} else {
Handle_NaN();
}
}
关于非二进制基数的浮点类型 (FLT_RADIX != 2
)
使用 FLT_RADIX = 4, 8, 16 ...
,转换也会很准确。使用 FLT_RADIX == 10
,代码至少精确到 34 位 int
,因为 double
必须精确编码 +/-10^10。所以说 FLT_RADIX == 10
、64 位 int
机器的问题 - 风险很低。根据记忆,最后一次 FLT_RADIX == 10
投入生产是在十多年前。
整数类型始终编码为 2 的补码(最常见)、1 的补码或符号大小。 INT_MAX
始终是 2 减 1 的幂。 INT_MIN
始终是 - 2 的幂或 1 以上。实际上,始终以 2 为基数。
此方法使用 C(不是 C++,请参阅第一条评论)标准中的浮点格式定义。知道有效数字的位数(由 numeric_limits::digits
提供)和指数限制(由 numeric_limits::max_exponent
提供)允许我们准备精确值作为端点。
我相信它可以在所有符合标准的 C++ 实现中工作,但要遵守最初评论中规定的适度附加要求。它支持带或不带无穷大的浮点格式,范围比目标整数格式更宽或更窄,以及任何舍入规则(因为它只使用具有完全可表示结果的浮点算法,所以永远不需要舍入)。
/* This code demonstrates safe conversion of floating-point to integer in
which the input floating-point value is converted to integer if and only if
it is in the supported domain for such conversions (the open interval
(Min-1, Max+1), where Min and Max are the mininum and maximum values
representable in the integer type). If the input is not in range, an error
throw and no conversion is performed. This throw can be replaced by any
desired error-indication mechanism so that all behavior is defined.
There are a few requirements not fully covered by the C++ standard. They
should be uncontroversial and supported by all reasonable C++
implementations:
The floating-point format is as described in C 2011 5.2.4.2.2 (modeled
by the product of a sign, a number of digits in some base b, and base b
raised to an exponent). I do not see this explicitly specified in the
C++ standard, but it is implied by the characteristics specified in
std::numeric_limits. (For example, C++ requires numeric_limits to
provide the number of base-b digits in the floating-point
representation, where b is the radix used, which means the
representation must have base-b digits.)
The following operations are exact in floating-point. (All of them
are elementary operations and have mathematical results that are
exactly representable, so there is no need for rounding, and hence
exact results are expected in any sane implementation.)
Dividing by the radix of the floating-point format, within its
range.
Multiplying by +1 or -1.
Adding or subtracting two values whose sum or difference is
representable.
std::numeric_limits<FPType>::min_exponent is not greater than
-std::numeric_limits<FPType>::digits. (The code can be modified to
eliminate this requirement.)
*/
#include <iostream> // Not needed except for demonstration.
#include <limits>
/* Define a class to support safe floating-point to integer conversions.
This sample code throws an exception when a source floating-point value is
not in the domain for which a correct integer result can be produced, but
the throw can be replaced with any desired code, such as returning an error
indication in an auxiliary object. (For example, one could return a pair
consisting of a success/error status and the destination value, if
successful.)
FPType is the source floating-point type.
IType is the destination integer type.
*/
template<typename FPType, typename IType> class FPToInteger
{
private:
/* Wrap the bounds we need in a static object so it can be easily
initialized just once for the entire program.
*/
static class StaticData
{
private:
/* This function helps us find the FPType values just inside the
interval (Min-1, Max+1), where Min and Max are the mininum and
maximum values representable in the integer type).
It returns the FPType of the same sign of x+s that has the greatest
magnitude less than x+s, where s is -1 or +1 according to whether x
is non-positive or positive.
*/
static FPType BiggestFPType(IType x)
{
/* All references to "digits" in this routine refer to digits in
base std::numeric_limits<FPType>::radix. For example, in base
3, 77 would have four digits (2212). Zero is considered to
have zero digits.
In this routine, "bigger" and "smaller" refer to magnitude. (3
is greater than -4, but -4 is bigger than 3.) */
// Abbreviate std::numeric_limits<FPType>::radix.
const int Radix = std::numeric_limits<FPType>::radix;
// Determine the sign.
int s = 0 < x ? +1 : -1;
// Count how many digits x has.
IType digits = 0;
for (IType t = x; t; ++digits)
t /= Radix;
/* If the FPType type cannot represent finite numbers this big,
return the biggest finite number it can hold, with the desired
sign.
*/
if (std::numeric_limits<FPType>::max_exponent < digits)
return s * std::numeric_limits<FPType>::max();
// Determine whether x is exactly representable in FPType.
if (std::numeric_limits<FPType>::digits < digits)
{
/* x is not representable, so we will return the next lower
representable value by removing just as many low digits as
necessary. Note that x+s might be representable, but we
want to return the biggest FPType less than it, which, in
this case, is also the biggest FPType less than x.
*/
/* Figure out how many digits we have to remove to leave at
most std::numeric_limits<FPType>::digits digits.
*/
digits = digits - std::numeric_limits<FPType>::digits;
// Calculate Radix to the power of digits.
IType t = 1;
while (digits--) t *= Radix;
return x / t * t;
}
else
{
/* x is representable. To return the biggest FPType smaller
than x+s, we will fill the remaining digits with Radix-1.
*/
// Figure out how many additional digits FPType can hold.
digits = std::numeric_limits<FPType>::digits - digits;
/* Put a 1 in the lowest available digit, then subtract from 1
to set each digit to Radix-1. (For example, 1 - .001 =
.999.)
*/
FPType t = 1;
while (digits--) t /= Radix;
t = 1-t;
// Return the biggest FPType smaller than x+s.
return x + s*t;
}
}
public:
/* These values will be initialized to the greatest FPType value less
than std::numeric_limits<IType>::max()+1 and the least FPType value
greater than std::numeric_limits<IType>::min()-1.
*/
const FPType UpperBound, LowerBound;
// Constructor to initialize supporting data for FPTypeToInteger.
StaticData()
: UpperBound(BiggestFPType(std::numeric_limits<IType>::max())),
LowerBound(BiggestFPType(std::numeric_limits<IType>::min()))
{
// Show values, just for illustration.
std::cout.precision(99);
std::cout << "UpperBound = " << UpperBound << ".\n";
std::cout << "LowerBound = " << LowerBound << ".\n";
}
} Data;
public:
FPType value;
// Constructor. Just remember the source value.
FPToInteger(FPType x) : value(x) {}
/* Perform the conversion. If the conversion is defined, return the
converted value. Otherwise, throw an exception.
*/
operator IType()
{
if (Data.LowerBound <= value && value <= Data.UpperBound)
return value;
else
throw "Error, source floating-point value is out of range.";
}
};
template<typename FPType, typename IType>
typename FPToInteger<FPType, IType>::StaticData
FPToInteger<FPType, IType>::Data;
typedef double FPType;
typedef int IType;
// Show what the class does with a requested value.
static void Test(FPType x)
{
try
{
IType y = FPToInteger<FPType, IType>(x);
std::cout << x << " -> " << y << ".\n";
}
catch (...)
{
std::cout << x << " is not in the domain.\n";
}
}
#include <cmath>
int main(void)
{
std::cout.precision(99);
// Simple demonstration (not robust testing).
Test(0);
Test(0x1p31);
Test(std::nexttoward(0x1p31, 0));
Test(-0x1p31-1);
Test(std::nexttoward(-0x1p31-1, 0));
}
是否有安全的方法来可靠地确定整数类型 T
是否可以存储浮点整数值 f
(因此 f == floor(f)
)而不会发生任何溢出?
请记住,不能保证浮点类型 F
与 IEC 559 (IEEE 754) 兼容,并且有符号整数溢出是 未定义行为在 C++ 中。我对根据当前 C++(写作时为 C++17)标准正确并避免 未定义行为.
以下天真的方法并不可靠,因为无法保证类型 F
可以表示 std::numeric_limits<I>::max()
由于浮点舍入。
#include <cmath>
#include <limits>
#include <type_traits>
template <typename I, typename F>
bool is_safe_conversion(F x)
{
static_assert(std::is_floating_point_v<F>);
static_assert(std::is_integral_v<I>);
// 'fmax' may have a different value than expected
static constexpr F fmax = static_cast<F>(std::numeric_limits<I>::max());
return std::abs(x) <= fmax; // this test may gives incorrect results
}
有什么想法吗?
你能不能不做
static_cast<F>(static_cast<I>(x)) == floor(x)
?
Any idea?
template <typename I, typename F>
constexpr F maxConvertible()
{
I i = std::numeric_limits<I>::max();
F f = F(i);
while(F(i) == f)
{
--i;
}
return F(i);
}
由于四舍五入,我们可能得到了一个太大的最大值,现在递减计数直到我们得到下一个更小的可表示双精度数,这应该适合积分...
问题悬而未决:如果转换为 double 涉及四舍五入,这工作正常;然而,即使 IEEE 754 也允许不同的舍入模式(如果应用舍入到最接近的舍入,这应该是当前硬件中最常见的舍入模式,向上舍入将始终发生......)。
我还没有找到安全检测向下舍入的解决方案(稍后可能会添加;至少检测 "rounding to nearest" 已经有解决方案 here),如果发生这种情况,我们会得到一些负面信息如果在积分值的最大值和最小值附近出现错误,您可能会考虑 "acceptable" 对于那些实际上进行向下舍入的少数异国情调的体系结构。
独立于向上舍入或向下舍入,有符号整数无论如何都有一个特殊情况:如果整数以二进制补码表示并且位数多于浮点值的尾数,则类型最小value 将表示为浮点值,而一些更大的值则不会。抓住这种情况需要特殊处理。
Is there a safe way to reliably determine if an integral type T can store a floating-point integer value f?
是的。关键是使用浮点数学测试 f
是否在 T::MIN - 0.999...
到 T::MAX + 0.999...
范围内——没有舍入问题。奖励:舍入模式不适用。
有 3 种失败路径:太大、太小、不是数字。
The below assumes
int/double
. I'll leave the C++ template forming for OP.
使用浮点数学精确地形成精确的 T::MAX + 1
很容易,因为 INT_MAX
是 Mersenne Number。 (我们在这里不是在谈论 Mersenne Prime。)
代码利用:
梅森数 除以 2 的整数数学也是 梅森数.
整数类型的2次幂常量到浮点类型的转换可以确定精确。
#define DBL_INT_MAXP1 (2.0*(INT_MAX/2+1))
// Below needed when -INT_MAX == INT_MIN
#define DBL_INT_MINM1 (2.0*(INT_MIN/2-1))
形成精确的T::MIN - 1
很难,因为它的绝对值通常是2的幂+1,而且整数类型和FP类型的相对精度是不确定的。相反,代码可以减去 2 的精确幂并与 -1 进行比较。
int double_to_int(double x) {
if (x < DBL_INT_MAXP1) {
#if -INT_MAX == INT_MIN
// rare non-2's complement machine
if (x > DBL_INT_MINM1) {
return (int) x;
}
#else
if (x - INT_MIN > -1.0) {
return (int) x;
}
#endif
Handle_Underflow();
} else if (x > 0) {
Handle_Overflow();
} else {
Handle_NaN();
}
}
关于非二进制基数的浮点类型 (FLT_RADIX != 2
)
使用 FLT_RADIX = 4, 8, 16 ...
,转换也会很准确。使用 FLT_RADIX == 10
,代码至少精确到 34 位 int
,因为 double
必须精确编码 +/-10^10。所以说 FLT_RADIX == 10
、64 位 int
机器的问题 - 风险很低。根据记忆,最后一次 FLT_RADIX == 10
投入生产是在十多年前。
整数类型始终编码为 2 的补码(最常见)、1 的补码或符号大小。 INT_MAX
始终是 2 减 1 的幂。 INT_MIN
始终是 - 2 的幂或 1 以上。实际上,始终以 2 为基数。
此方法使用 C(不是 C++,请参阅第一条评论)标准中的浮点格式定义。知道有效数字的位数(由 numeric_limits::digits
提供)和指数限制(由 numeric_limits::max_exponent
提供)允许我们准备精确值作为端点。
我相信它可以在所有符合标准的 C++ 实现中工作,但要遵守最初评论中规定的适度附加要求。它支持带或不带无穷大的浮点格式,范围比目标整数格式更宽或更窄,以及任何舍入规则(因为它只使用具有完全可表示结果的浮点算法,所以永远不需要舍入)。
/* This code demonstrates safe conversion of floating-point to integer in
which the input floating-point value is converted to integer if and only if
it is in the supported domain for such conversions (the open interval
(Min-1, Max+1), where Min and Max are the mininum and maximum values
representable in the integer type). If the input is not in range, an error
throw and no conversion is performed. This throw can be replaced by any
desired error-indication mechanism so that all behavior is defined.
There are a few requirements not fully covered by the C++ standard. They
should be uncontroversial and supported by all reasonable C++
implementations:
The floating-point format is as described in C 2011 5.2.4.2.2 (modeled
by the product of a sign, a number of digits in some base b, and base b
raised to an exponent). I do not see this explicitly specified in the
C++ standard, but it is implied by the characteristics specified in
std::numeric_limits. (For example, C++ requires numeric_limits to
provide the number of base-b digits in the floating-point
representation, where b is the radix used, which means the
representation must have base-b digits.)
The following operations are exact in floating-point. (All of them
are elementary operations and have mathematical results that are
exactly representable, so there is no need for rounding, and hence
exact results are expected in any sane implementation.)
Dividing by the radix of the floating-point format, within its
range.
Multiplying by +1 or -1.
Adding or subtracting two values whose sum or difference is
representable.
std::numeric_limits<FPType>::min_exponent is not greater than
-std::numeric_limits<FPType>::digits. (The code can be modified to
eliminate this requirement.)
*/
#include <iostream> // Not needed except for demonstration.
#include <limits>
/* Define a class to support safe floating-point to integer conversions.
This sample code throws an exception when a source floating-point value is
not in the domain for which a correct integer result can be produced, but
the throw can be replaced with any desired code, such as returning an error
indication in an auxiliary object. (For example, one could return a pair
consisting of a success/error status and the destination value, if
successful.)
FPType is the source floating-point type.
IType is the destination integer type.
*/
template<typename FPType, typename IType> class FPToInteger
{
private:
/* Wrap the bounds we need in a static object so it can be easily
initialized just once for the entire program.
*/
static class StaticData
{
private:
/* This function helps us find the FPType values just inside the
interval (Min-1, Max+1), where Min and Max are the mininum and
maximum values representable in the integer type).
It returns the FPType of the same sign of x+s that has the greatest
magnitude less than x+s, where s is -1 or +1 according to whether x
is non-positive or positive.
*/
static FPType BiggestFPType(IType x)
{
/* All references to "digits" in this routine refer to digits in
base std::numeric_limits<FPType>::radix. For example, in base
3, 77 would have four digits (2212). Zero is considered to
have zero digits.
In this routine, "bigger" and "smaller" refer to magnitude. (3
is greater than -4, but -4 is bigger than 3.) */
// Abbreviate std::numeric_limits<FPType>::radix.
const int Radix = std::numeric_limits<FPType>::radix;
// Determine the sign.
int s = 0 < x ? +1 : -1;
// Count how many digits x has.
IType digits = 0;
for (IType t = x; t; ++digits)
t /= Radix;
/* If the FPType type cannot represent finite numbers this big,
return the biggest finite number it can hold, with the desired
sign.
*/
if (std::numeric_limits<FPType>::max_exponent < digits)
return s * std::numeric_limits<FPType>::max();
// Determine whether x is exactly representable in FPType.
if (std::numeric_limits<FPType>::digits < digits)
{
/* x is not representable, so we will return the next lower
representable value by removing just as many low digits as
necessary. Note that x+s might be representable, but we
want to return the biggest FPType less than it, which, in
this case, is also the biggest FPType less than x.
*/
/* Figure out how many digits we have to remove to leave at
most std::numeric_limits<FPType>::digits digits.
*/
digits = digits - std::numeric_limits<FPType>::digits;
// Calculate Radix to the power of digits.
IType t = 1;
while (digits--) t *= Radix;
return x / t * t;
}
else
{
/* x is representable. To return the biggest FPType smaller
than x+s, we will fill the remaining digits with Radix-1.
*/
// Figure out how many additional digits FPType can hold.
digits = std::numeric_limits<FPType>::digits - digits;
/* Put a 1 in the lowest available digit, then subtract from 1
to set each digit to Radix-1. (For example, 1 - .001 =
.999.)
*/
FPType t = 1;
while (digits--) t /= Radix;
t = 1-t;
// Return the biggest FPType smaller than x+s.
return x + s*t;
}
}
public:
/* These values will be initialized to the greatest FPType value less
than std::numeric_limits<IType>::max()+1 and the least FPType value
greater than std::numeric_limits<IType>::min()-1.
*/
const FPType UpperBound, LowerBound;
// Constructor to initialize supporting data for FPTypeToInteger.
StaticData()
: UpperBound(BiggestFPType(std::numeric_limits<IType>::max())),
LowerBound(BiggestFPType(std::numeric_limits<IType>::min()))
{
// Show values, just for illustration.
std::cout.precision(99);
std::cout << "UpperBound = " << UpperBound << ".\n";
std::cout << "LowerBound = " << LowerBound << ".\n";
}
} Data;
public:
FPType value;
// Constructor. Just remember the source value.
FPToInteger(FPType x) : value(x) {}
/* Perform the conversion. If the conversion is defined, return the
converted value. Otherwise, throw an exception.
*/
operator IType()
{
if (Data.LowerBound <= value && value <= Data.UpperBound)
return value;
else
throw "Error, source floating-point value is out of range.";
}
};
template<typename FPType, typename IType>
typename FPToInteger<FPType, IType>::StaticData
FPToInteger<FPType, IType>::Data;
typedef double FPType;
typedef int IType;
// Show what the class does with a requested value.
static void Test(FPType x)
{
try
{
IType y = FPToInteger<FPType, IType>(x);
std::cout << x << " -> " << y << ".\n";
}
catch (...)
{
std::cout << x << " is not in the domain.\n";
}
}
#include <cmath>
int main(void)
{
std::cout.precision(99);
// Simple demonstration (not robust testing).
Test(0);
Test(0x1p31);
Test(std::nexttoward(0x1p31, 0));
Test(-0x1p31-1);
Test(std::nexttoward(-0x1p31-1, 0));
}