为什么将 GC 限制为 1 个线程会提高性能？

Question

我编写了一些简单的 java 代码来人为地使用大量 RAM，我发现当我使用这些标志时获得相关时间：

1029.59 seconds .... -Xmx8g -Xms256m
696.44 seconds ..... -XX:ParallelGCThreads=1  -Xmx8g -Xms256m
247.27 seconds ..... -XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC  -Xmx8g -Xms256m

现在，我明白了为什么 -XX:+UseConcMarkSweepGC 会提高性能，但为什么在我限制为单线程 GC 时会得到加速？这是我写得不好的 java 代码的产物，还是这也适用于适当优化的 java？

这是我的代码：

import java.io.*;

class xdriver {
  static int N = 100;
  static double pi = 3.141592653589793;
  static double one = 1.0;
  static double two = 2.0;

  public static void main(String[] args) {
    //System.out.println("Program has started successfully\n");

    if( args.length == 1) {
      // assume that args[0] is an integer
      N = Integer.parseInt(args[0]);
    }   

    // maybe we can get user input later on this ...
    int nr = N;
    int nt = N;
    int np = 2*N;

    double dr = 1.0/(double)(nr-1);
    double dt = pi/(double)(nt-1);
    double dp = (two*pi)/(double)(np-1);

    System.out.format("nn --> %d\n", nr*nt*np);

    if(nr*nt*np < 0) {
      System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n", (long)( (long)nr*(long)nt*(long)np), nr*nt*np);
      System.exit(1);
    }   

    // inserted to artificially blow up RAM
    double[][] dels = new double [nr*nt*np][3];

    double[] rs = new double[nr];
    double[] ts = new double[nt];
    double[] ps = new double[np];

    for(int ir = 0; ir < nr; ir++) {
      rs[ir] = dr*(double)(ir);
    }   
    for(int it = 0; it < nt; it++) {
      ts[it] = dt*(double)(it);
    }   
    for(int ip = 0; ip < np; ip++) {
      ps[ip] = dp*(double)(ip);
    }   

    double C = (4.0/3.0)*pi;
    C = one/C;

    double fint = 0.0;
    int ii = 0;
    for(int ir = 0; ir < nr; ir++) {
      double r = rs[ir];
      double r2dr = r*r*dr;
      for(int it = 0; it < nt; it++) {
        double t = ts[it];
        double sint = Math.sin(t);
        for(int ip = 0; ip < np; ip++) {
          fint += C*r2dr*sint*dt*dp;

          dels[ii][0] = dr; 
          dels[ii][1] = dt; 
          dels[ii][2] = dp; 
        }   
      }   
    }   

    System.out.format("N ........ %d\n", N);
    System.out.format("fint ..... %15.10f\n", fint);
    System.out.format("err ...... %15.10f\n", Math.abs(1.0-fint));
  }
}

Answer 1

https://community.oracle.com/thread/2191327

ParallelGCThreads set the number of threads and possibly cores the GC will use.

If you set this to 8 it can speed up your GC time, however it could mean all your other applications have to stop or will be competing with these threads.

It may be undesirable to have all your applications stop or slow down when any JVM wants to GC.

As such, a setting of 2 may be your best choice. You might find 3 or 4 is fine for your usage pattern (if your JVMs are typically idle) otherwise I suggest, stick with 2.

Answer 2

我不是垃圾收集器方面的专家，所以这可能不是您想要得到的答案，但也许我在您的问题上的发现很有趣。

首先，我已将您的代码更改为 JUnit test case. Then I've added the JUnitBenchmarks extension from Carrot Search Labs。它多次运行s JUnit 测试用例，测量运行时间，并输出一些性能统计数据。最重要的是 JUnitBenchMarks 确实 'warmup'，即它在实际进行测量之前运行对代码进行了多次

我的最终代码运行:

import com.carrotsearch.junitbenchmarks.AbstractBenchmark;
import com.carrotsearch.junitbenchmarks.BenchmarkOptions;
import com.carrotsearch.junitbenchmarks.annotation.BenchmarkHistoryChart;
import com.carrotsearch.junitbenchmarks.annotation.LabelType;

@BenchmarkOptions(benchmarkRounds = 10, warmupRounds = 5)
@BenchmarkHistoryChart(labelWith = LabelType.CUSTOM_KEY, maxRuns = 20)
public class XDriverTest extends AbstractBenchmark {
    static int N = 200;
    static double pi = 3.141592653589793;
    static double one = 1.0;
    static double two = 2.0;

    @org.junit.Test
    public void test() {
        // System.out.println("Program has started successfully\n");
        // maybe we can get user input later on this ...
        int nr = N;
        int nt = N;
        int np = 2 * N;

        double dr = 1.0 / (double) (nr - 1);
        double dt = pi / (double) (nt - 1);
        double dp = (two * pi) / (double) (np - 1);

        System.out.format("nn --> %d\n", nr * nt * np);

        if (nr * nt * np < 0) {
            System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n",
                    (long) ((long) nr * (long) nt * (long) np), nr * nt * np);
            System.exit(1);
        }

        // inserted to artificially blow up RAM
        double[][] dels = new double[nr * nt * np][4];

        double[] rs = new double[nr];
        double[] ts = new double[nt];
        double[] ps = new double[np];

        for (int ir = 0; ir < nr; ir++) {
            rs[ir] = dr * (double) (ir);
        }
        for (int it = 0; it < nt; it++) {
            ts[it] = dt * (double) (it);
        }
        for (int ip = 0; ip < np; ip++) {
            ps[ip] = dp * (double) (ip);
        }

        double C = (4.0 / 3.0) * pi;
        C = one / C;

        double fint = 0.0;
        int ii = 0;
        for (int ir = 0; ir < nr; ir++) {
            double r = rs[ir];
            double r2dr = r * r * dr;
            for (int it = 0; it < nt; it++) {
                double t = ts[it];
                double sint = Math.sin(t);
                for (int ip = 0; ip < np; ip++) {
                    fint += C * r2dr * sint * dt * dp;

                    dels[ii][0] = dr;
                    dels[ii][5] = dt;
                    dels[ii][6] = dp;
                }
            }
        }

        System.out.format("N ........ %d\n", N);
        System.out.format("fint ..... %15.10f\n", fint);
        System.out.format("err ...... %15.10f\n", Math.abs(1.0 - fint));
    }
}

正如您从基准选项 @BenchmarkOptions(benchmarkRounds = 10, warmupRounds = 5) 中看到的那样，预热是通过运行测试方法 5 次完成的，之后实际基准是运行 10 次。

然后我运行上面的程序有几个不同的 GC 选项（每个都有 -Xmx1g -Xms256m 的一般堆设置）：

默认（无特殊选项）
-XX:ParallelGCThreads=1 -Xmx1g -Xms256m
-XX:ParallelGCThreads=2 -Xmx1g -Xms256m
-XX:ParallelGCThreads=4 -Xmx1g -Xms256m
-XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=2 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m

为了获得带有图表的摘要 HTML 页，除了上面提到的 GC 设置之外，还传递了以下 VM 参数：

-Djub.consumers=CONSOLE,H2 -Djub.db.file=.benchmarks
-Djub.customkey=[CUSTOM_KEY]

(其中[CUSTOM_KEY]必须是唯一标识每个基准的字符串运行，例如defaultGC或ParallelGCThreads=1。它用作坐标轴上的标签图表）。

下表总结了结果：

Run Custom key          Timestamp                   test
1   defaultGC           2015-05-01 19:43:53.796     10.721
2   ParallelGCThreads=1 2015-05-01 19:51:07.79       8.770
3   ParallelGCThreads=2 2015-05-01 19:56:44.985      8.737
4   ParallelGCThreads=4 2015-05-01 20:01:30.071     10.415
5   UseConcMarkSweepGC  2015-05-01 20:03:54.474      2.683
6   UseCCMS,Threads=1   2015-05-01 20:10:48.504      3.856
7   UseCCMS,Threads=2   2015-05-01 20:12:58.624      3.861
8   UseCCMS,Threads=4   2015-05-01 20:13:58.94       2.701

系统信息：CPU：Intel Core 2 Quad Q9400，2.66 GHz，RAM：4.00 GB，OS：Windows 8.1 x64，JVM：1.8。0_05-b13.

（请注意，单独的基准测试运行会输出更详细的信息，例如标准派生 GC 调用和时间；不幸的是，此信息在摘要中不可用）。

解读

如您所见，启用 -XX:+UseConcMarkSweepGC 后性能会大幅提升。线程数对性能的影响不大，更多线程是否有利取决于一般的GC策略。默认的 GC 似乎从两个或三个线程中获益，但如果使用四个线程，性能会变差。

相反，具有四个线程的 ConcurrentMarkSweep GC 比具有一个或两个线程的性能更高。

所以一般来说，我们不能说GC线程越多性能越差。

注意我不知道，在没有指定线程数的情况下使用默认GC或者ConcurrentMarkSweep GC时使用了多少个GC线程

为什么将 GC 限制为 1 个线程会提高性能？

Why does restricting GC to 1 thread increase performance?

java

multithreading

garbage-collection