使用 awk 计算意外值
Unexpected value counts using awk
我有一个名为 "test.txt" 的文本文件,其中包含多行,字段之间用分号分隔。我正在尝试获取 field3 的值 > 去除字段中除数字之外的所有内容 > 将其与上一行中字段 3 的值进行比较 > 如果该值是唯一的,则重定向字段 3 的值及其之间的差异以及名为 "differences.txt".
的文件的最后一个值
到目前为止,我有以下代码:
awk -F';' '
BEGIN{d=0} {gsub(/^.*=/,"",);
if(d>0 && -d>0){print ,-d} d=}
' test.txt > differences.txt
当我在以下文本中尝试 运行 时,这绝对有效:
field1=xxx;field2=xxx;field3=111222222;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222222;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222333;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222444;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222555;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222555;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222777;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222888;field4=xxx;field5=xxx
输出,如预期:
111222333 111
111222444 111
111222555 111
111222777 222
111222888 111
然而,当我尝试 运行 输入以下文本时,我得到了完全不同的意外数字 - 我不确定这是由于字段长度增加还是其他原因??
测试:
test=none;test=20170606;test=1111111111111111111;
test=none;test=20170606;test=2222222222222222222;
test=none;test=20170606;test=3333333333333333333;
test=none;test=20170606;test=4444444444444444444;
test=none;test=20170606;test=5555555555555555555;
test=none;test=20170606;test=5555555555555555555;
test=none;test=20170606;test=6666666666666666666;
test=none;test=20170606;test=7777777777777777777;
test=none;test=20170606;test=8888888888888888888;
test=none;test=20170606;test=9999999999999999999;
test=none;test=20170606;test=100000000000000000000;
test=none;test=20170606;test=11111111111111111111;
输出,具有意外值:
2222222222222222222 1111111111111111168
3333333333333333333 1111111111111111168
4444444444444444444 1111111111111111168
5555555555555555555 1111111111111110656
6666666666666666666 1111111111111111680
7777777777777777777 1111111111111110656
8888888888888888888 1111111111111111680
9999999999999999999 1111111111111110656
100000000000000000000 90000000000000000000
任何人都可以看到我哪里出错了,因为我显然遗漏了一些东西......这让我精神错乱!!
非常感谢! :)
第二个示例输入中的数字太大。
虽然程序的逻辑是正确的,
在使用非常大的整数进行计算时会出现精度损失,例如 2222222222222222222 - 1111111111111111111
导致 1111111111111111168
而不是预期的 1111111111111111111
.
As has been mentioned already, awk uses hardware double precision with 64-bit IEEE binary floating-point representation for numbers on most systems. A large integer like 9,007,199,254,740,997 has a binary representation that, although finite, is more than 53 bits long; it must also be rounded to 53 bits. The biggest integer that can be stored in a C double is usually the same as the largest possible value of a double. If your system double is an IEEE 64-bit double, this largest possible value is an integer and can be represented precisely. What more should one know about integers?
If you want to know what is the largest integer, such that it and all smaller integers can be stored in 64-bit doubles without losing precision, then the answer is 2^53. The next representable number is the even number 2^53 + 2, meaning it is unlikely that you will be able to make gawk print 2^53 + 1 in integer format. The range of integers exactly representable by a 64-bit double is [-2^53, 2^53]. If you ever see an integer outside this range in awk using 64-bit doubles, you have reason to be very suspicious about the accuracy of the output.
正如 @EdMorton 在评论中指出的那样,
如果您的 Awk 是在支持 MPFR 的情况下编译的并且您指定了 -M
标志,那么您可以进行任意精度的运算。
有关详细信息,请参阅 15.3 Arbitrary-Precision Arithmetic Features。
我有一个名为 "test.txt" 的文本文件,其中包含多行,字段之间用分号分隔。我正在尝试获取 field3 的值 > 去除字段中除数字之外的所有内容 > 将其与上一行中字段 3 的值进行比较 > 如果该值是唯一的,则重定向字段 3 的值及其之间的差异以及名为 "differences.txt".
的文件的最后一个值到目前为止,我有以下代码:
awk -F';' '
BEGIN{d=0} {gsub(/^.*=/,"",);
if(d>0 && -d>0){print ,-d} d=}
' test.txt > differences.txt
当我在以下文本中尝试 运行 时,这绝对有效:
field1=xxx;field2=xxx;field3=111222222;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222222;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222333;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222444;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222555;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222555;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222777;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222888;field4=xxx;field5=xxx
输出,如预期:
111222333 111
111222444 111
111222555 111
111222777 222
111222888 111
然而,当我尝试 运行 输入以下文本时,我得到了完全不同的意外数字 - 我不确定这是由于字段长度增加还是其他原因??
测试:
test=none;test=20170606;test=1111111111111111111;
test=none;test=20170606;test=2222222222222222222;
test=none;test=20170606;test=3333333333333333333;
test=none;test=20170606;test=4444444444444444444;
test=none;test=20170606;test=5555555555555555555;
test=none;test=20170606;test=5555555555555555555;
test=none;test=20170606;test=6666666666666666666;
test=none;test=20170606;test=7777777777777777777;
test=none;test=20170606;test=8888888888888888888;
test=none;test=20170606;test=9999999999999999999;
test=none;test=20170606;test=100000000000000000000;
test=none;test=20170606;test=11111111111111111111;
输出,具有意外值:
2222222222222222222 1111111111111111168
3333333333333333333 1111111111111111168
4444444444444444444 1111111111111111168
5555555555555555555 1111111111111110656
6666666666666666666 1111111111111111680
7777777777777777777 1111111111111110656
8888888888888888888 1111111111111111680
9999999999999999999 1111111111111110656
100000000000000000000 90000000000000000000
任何人都可以看到我哪里出错了,因为我显然遗漏了一些东西......这让我精神错乱!!
非常感谢! :)
第二个示例输入中的数字太大。
虽然程序的逻辑是正确的,
在使用非常大的整数进行计算时会出现精度损失,例如 2222222222222222222 - 1111111111111111111
导致 1111111111111111168
而不是预期的 1111111111111111111
.
As has been mentioned already, awk uses hardware double precision with 64-bit IEEE binary floating-point representation for numbers on most systems. A large integer like 9,007,199,254,740,997 has a binary representation that, although finite, is more than 53 bits long; it must also be rounded to 53 bits. The biggest integer that can be stored in a C double is usually the same as the largest possible value of a double. If your system double is an IEEE 64-bit double, this largest possible value is an integer and can be represented precisely. What more should one know about integers?
If you want to know what is the largest integer, such that it and all smaller integers can be stored in 64-bit doubles without losing precision, then the answer is 2^53. The next representable number is the even number 2^53 + 2, meaning it is unlikely that you will be able to make gawk print 2^53 + 1 in integer format. The range of integers exactly representable by a 64-bit double is [-2^53, 2^53]. If you ever see an integer outside this range in awk using 64-bit doubles, you have reason to be very suspicious about the accuracy of the output.
正如 @EdMorton 在评论中指出的那样,
如果您的 Awk 是在支持 MPFR 的情况下编译的并且您指定了 -M
标志,那么您可以进行任意精度的运算。
有关详细信息,请参阅 15.3 Arbitrary-Precision Arithmetic Features。