由于内存不足错误,StormCrawler 抛出暂停
StormCrawler throws Halting due to Out Of Memory Error
正在开发 Storm 爬虫 1.13 和弹性搜索 6.5.2。下面是我的爬虫配置。我正在抓取一个拥有数百万文档的网站。如果我通过应用 fast.urlfilter.json 执行特定于域的爬网,爬虫不会给我任何类型的错误。当我通过应用 "ignoreOutsideHost": false,"ignoreOutsideDomain": true 指向主域时,它会抛出 java.lang.OutOfMemoryError: Java 堆 space 和 由于内存不足错误而停止...FetcherThread #0。没有任何内存错误的平滑爬行的任何解决方案。 Click for crawler configuration 和
详细日志更新如下。
在此先感谢并为巨大的post道歉。
worker.log:
2019-01-22 08:31:51.989 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://arts.test.edu/login/?next=/schools/film-animation/other-school-film-and-animation-festivals-and-awards/test-film-and-animation-awards-1998 with status 200 in msec 107
2019-01-22 08:31:56.815 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://portfolios.test.edu/search?tags=Othello with status 200 in msec 162
2019-01-22 08:32:46.572 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://spiff.test.edu/richmond/testobs/jul25_2013/?C=S;O=A with status 200 in msec 3
2019-01-22 08:32:01.862 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://campusgroups.test.edu/slu/members/ with status 200 in msec 229
2019-01-22 08:32:06.693 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://arts.test.edu/news/16 with status 200 in msec 119
2019-01-22 08:32:11.601 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.apply.test.edu is set to 10000 as per robots.txt. url: https://www.apply.test.edu/news/testapply-holds-student-research-fair
2019-01-22 08:32:13.765 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.apply.test.edu/news/testapply-holds-student-research-fair with status 200 in msec 2164
2019-01-22 08:32:16.616 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://apps.test.edu/cos/scms/equipment/schedules.php?id=25&date=9-21-2019 with status 200 in msec 46
2019-01-22 08:32:21.780 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://edge.test.edu/edge/P19319/public/FILENAME.docx with status 200 in msec 156
2019-01-22 08:32:27.837 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://applywebdev.test.edu/news/booth-biography-selected-national-reading-project?page=6 with status 200 in msec 1231
2019-01-22 08:32:30.075 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://applywebdev.test.edu/news/grant-improve-problem-solving-skills-deaf-and-hard-hearing-students?page=6 with status 200 in msec 1235
2019-01-22 08:32:31.775 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://portfolios.test.edu/search?tags=feedback with status 200 in msec 197
2019-01-22 08:32:36.582 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: infoguides.test.edu is set to 10000 as per robots.txt. url: http://infoguides.test.edu/c.php?g=357360&p=4416876
2019-01-22 08:32:36.693 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://infoguides.test.edu/c.php?g=357360&p=4416876 with status 200 in msec 111
2019-01-22 08:32:41.602 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.sic.test.edu is set to 10000 as per robots.txt. url: https://www.sic.test.edu/news/sic-undergraduate-research-sparks-prestigious-professorship-astronomy?page=10
2019-01-22 08:32:42.455 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.sic.test.edu/news/sic-undergraduate-research-sparks-prestigious-professorship-astronomy?page=10 with status 200 in msec 853
2019-01-22 08:32:46.572 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://spiff.test.edu/richmond/testobs/jul25_2013/?C=S;O=A with status 200 in msec 3
2019-01-22 08:32:51.595 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.apply.test.edu is set to 10000 as per robots.txt. url: https://www.apply.test.edu/news/testapply-students-graduate-accolades
2019-01-22 08:32:53.748 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.apply.test.edu/news/testapply-students-graduate-accolades with status 200 in msec 2152
2019-01-22 08:33:01.976 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://inside.test.edu/?date=2023-12-1&t=list with status 200 in msec 355
2019-01-22 08:33:11.957 STDIO FetcherThread #0 [ERROR] Halting due to Out Of Memory Error...FetcherThread #0
2019-01-22 08:33:11.960 STDERR Thread-2 [INFO] java.lang.OutOfMemoryError: Java heap space
2019-01-22 08:33:11.968 STDERR Thread-2 [INFO] Dumping heap to artifacts/heapdump ...
2019-01-22 08:33:11.968 STDERR Thread-2 [INFO] Unable to create artifacts/heapdump: File exists
supervisor.log:
2019-01-22 08:31:40.341 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Created Worker ID da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] Setting up 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] SET worker-user da2944c7-cfd2-409a-856b-84f0a0014f56 testweb
2019-01-22 08:31:40.342 o.a.s.d.s.Container SLOT_6700 [INFO] Creating symlinks for worker-id: da2944c7-cfd2-409a-856b-84f0a0014f56 storm-id: www-staging-crawler-4-1548106042 for files(1): [resources]
2019-01-22 08:31:40.342 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Launching worker with assignment LocalAssignment(topology_id:www-staging-crawler-4-1548106042, executors:[ExecutorInfo(task_start:8, task_end:8), ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:9, task_end:9), ExecutorInfo(task_start:5, task_end:5)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:testweb) for this supervisor 164ddb0a-fcba-41e3-9a14-386248370bcf on port 6700 with id da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.342 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Launching worker with command: 'java' '-cp' '/home/testweb/apps/crawler/apache-storm-1.2.2/lib/*:/home/testweb/apps/crawler/apache-storm-1.2.2/extlib/*:/home/testweb/crawler/apache-storm-1.2.2/conf:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/stormjar.jar' '-Xmx64m' '-Dlogging.sensitivity=S3' '-Dlogfile.name=worker.log' '-Dstorm.home=/home/testweb/apps/crawler/apache-storm-1.2.2' '-Dworkers.artifacts=/home/testweb/var/logs/workers-artifacts' '-Dstorm.id=www-staging-crawler-4-1548106042' '-Dworker.id=da2944c7-cfd2-409a-856b-84f0a0014f56' '-Dworker.port=6700' '-Dstorm.log.dir=/home/testweb/var/logs' '-Dlog4j.configurationFile=/home/testweb/apps/crawler/apache-storm-1.2.2/log4j2/worker.xml' '-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector' '-Dstorm.local.dir=storm-local' 'org.apache.storm.LogWtester' 'java' '-server' '-Dlogging.sensitivity=S3' '-Dlogfile.name=worker.log' '-Dstorm.home=/home/testweb/apps/crawler/apache-storm-1.2.2' '-Dworkers.artifacts=/home/testweb/var/logs/workers-artifacts' '-Dstorm.id=www-staging-crawler-4-1548106042' '-Dworker.id=da2944c7-cfd2-409a-856b-84f0a0014f56' '-Dworker.port=6700' '-Dstorm.log.dir=/home/testweb/var/logs' '-Dlog4j.configurationFile=/home/testweb/apps/crawler/apache-storm-1.2.2/log4j2/worker.xml' '-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector' '-Dstorm.local.dir=storm-local' '-Xmx2048m' '-XX:+PrintGCDetails' '-Xloggc:artifacts/gc.log' '-XX:+PrintGCDateStamps' '-XX:+PrintGCTimeStamps' '-XX:+UseGCLogFileRotation' '-XX:NumberOfGCLogFiles=10' '-XX:GCLogFileSize=1M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:HeapDumpPath=artifacts/heapdump' '-Djava.library.path=/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/resources/Linux-amd64:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/resources:/usr/local/lib:/opt/local/lib:/usr/lib' '-Dstorm.conf.file=' '-Dstorm.options=' '-Djava.io.tmpdir=/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/tmp' '-cp' '/home/testweb/apps/crawler/apache-storm-1.2.2/lib/*:/home/testweb/apps/crawler/apache-storm-1.2.2/extlib/*:/home/testweb/crawler/apache-storm-1.2.2/conf:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/stormjar.jar' 'org.apache.storm.daemon.worker' 'www-staging-crawler-4-1548106042' '164ddb0a-fcba-41e3-9a14-386248370bcf' '6700' 'da2944c7-cfd2-409a-856b-84f0a0014f56'.
2019-01-22 08:31:40.344 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL_AND_RELAUNCH msInState: 18 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> WAITING_FOR_WORKER_START msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:45.350 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_WORKER_START msInState: 5006 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> RUNNING msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:12.328 o.a.s.d.s.BasicContainer Thread-2505 [INFO] Worker Process da2944c7-cfd2-409a-856b-84f0a0014f56 exited with code: 255
2019-01-22 08:33:12.370 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700: main process has exited
2019-01-22 08:33:12.370 o.a.s.d.s.Container SLOT_6700 [INFO] Killing 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:12.380 o.a.s.u.Utils SLOT_6700 [INFO] Error when trying to kill 1554. Process is probably already dead.
2019-01-22 08:33:15.380 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 90030 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> KILL_AND_RELAUNCH msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.381 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.394 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.395 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.395 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/pids/1554
2019-01-22 08:33:15.395 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/heartbeats
2019-01-22 08:33:15.399 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/pids
2019-01-22 08:33:15.399 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/tmp
2019-01-22 08:33:15.400 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers-users/da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID da2944c7-cfd2-409a-856b-84f0a0014f56
gc.log.0.current:
Java HotSpot(TM) 64-Bit Server VM (25.191-b26) for linux-amd64 JRE (1.8.0_191-b26), built on Oct 8 2018 13:54:08 by "java_re" with gcc 7.3.0
Memory: 4k page, physical 8168328k(1737328k free), swap 8387580k(8386288k free)
CommandLine flags: -XX:GCLogFileSize=1048576 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump -XX:InitialHeapSize=130693248 -XX:MaxHeapSize=2147483648 -XX:NumberOfGCLogFiles=10 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseGCLogFileRotation -XX:+UseParallelGC
2019-01-22T08:31:41.541-0500: 1.028: [GC (Allocation Failure) [PSYoungGen: 32768K->5096K(37888K)] 32768K->6882K(123904K), 0.0098372 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:31:42.155-0500: 1.642: [GC (Allocation Failure) [PSYoungGen: 37864K->5110K(37888K)] 39650K->10524K(123904K), 0.0104951 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:31:42.557-0500: 2.044: [GC (Metadata GC Threshold) [PSYoungGen: 24280K->5094K(37888K)] 29694K->12912K(123904K), 0.0129743 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2019-01-22T08:31:42.570-0500: 2.057: [Full GC (Metadata GC Threshold) [PSYoungGen: 5094K->0K(37888K)] [ParOldGen: 7817K->7345K(64000K)] 12912K->7345K(101888K), [Metaspace: 21023K->21023K(1067008K)], 0.0578299 secs] [Times: user=0.13 sys=0.01, real=0.06 secs]
2019-01-22T08:31:42.858-0500: 2.344: [GC (Allocation Failure) [PSYoungGen: 32768K->2425K(48128K)] 40113K->9771K(112128K), 0.0039971 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
2019-01-22T08:31:43.563-0500: 3.050: [GC (Allocation Failure) [PSYoungGen: 47993K->5099K(68096K)] 55339K->15796K(132096K), 0.0183739 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]
2019-01-22T08:31:44.248-0500: 3.735: [GC (Metadata GC Threshold) [PSYoungGen: 45605K->9669K(74752K)] 56303K->20375K(138752K), 0.0171562 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
2019-01-22T08:31:44.266-0500: 3.752: [Full GC (Metadata GC Threshold) [PSYoungGen: 9669K->0K(74752K)] [ParOldGen: 10705K->14480K(108032K)] 20375K->14480K(182784K), [Metaspace: 34870K->34870K(1079296K)], 0.1069368 secs] [Times: user=0.36 sys=0.01, real=0.11 secs]
2019-01-22T08:31:45.775-0500: 5.261: [GC (GCLocker Initiated GC) [PSYoungGen: 63488K->8826K(75776K)] 77975K->23321K(183808K), 0.0103824 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2019-01-22T08:31:46.619-0500: 6.106: [GC (Allocation Failure) [PSYoungGen: 72314K->12264K(90624K)] 86844K->30380K(198656K), 0.0228691 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
2019-01-22T08:31:47.414-0500: 6.901: [GC (Allocation Failure) [PSYoungGen: 90600K->15337K(93696K)] 108716K->33992K(201728K), 0.0215458 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
2019-01-22T08:31:47.499-0500: 6.986: [GC (Allocation Failure) [PSYoungGen: 93636K->14043K(110080K)] 112291K->32707K(218112K), 0.0191082 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
2019-01-22T08:31:47.565-0500: 7.052: [GC (Allocation Failure) [PSYoungGen: 106715K->13585K(111104K)] 125379K->32256K(219136K), 0.0110566 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2019-01-22T08:31:47.975-0500: 7.461: [GC (Allocation Failure) [PSYoungGen: 106257K->9626K(148480K)] 124928K->37589K(256512K), 0.0329521 secs] [Times: user=0.07 sys=0.02, real=0.03 secs]
2019-01-22T08:31:48.847-0500: 8.334: [GC (Metadata GC Threshold) [PSYoungGen: 120769K->5799K(149504K)] 148732K->123739K(344576K), 0.0346237 secs] [Times: user=0.07 sys=0.02, real=0.04 secs]
2019-01-22T08:31:48.882-0500: 8.369: [Full GC (Metadata GC Threshold) [PSYoungGen: 5799K->0K(149504K)] [ParOldGen: 117940K->115617K(263680K)] 123739K->115617K(413184K), [Metaspace: 57889K->57857K(1099776K)], 0.2179918 secs] [Times: user=0.66 sys=0.01, real=0.21 secs]
2019-01-22T08:31:56.805-0500: 16.291: [GC (Allocation Failure) [PSYoungGen: 131072K->4807K(189440K)] 246689K->120432K(453120K), 0.0092119 secs] [Times: user=0.03 sys=0.01, real=0.01 secs]
2019-01-22T08:32:11.898-0500: 31.385: [GC (Allocation Failure) [PSYoungGen: 181447K->1713K(195072K)] 297072K->120453K(458752K), 0.0062305 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:32:26.904-0500: 46.391: [GC (Allocation Failure) [PSYoungGen: 178353K->981K(234496K)] 297093K->120609K(498176K), 0.0048011 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2019-01-22T08:32:47.815-0500: 67.302: [GC (Allocation Failure) [PSYoungGen: 223701K->1518K(241664K)] 343329K->121154K(505344K), 0.0102639 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2019-01-22T08:33:07.716-0500: 87.203: [GC (Allocation Failure) [PSYoungGen: 194483K->1385K(262144K)] 314119K->121029K(525824K), 0.0059916 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:33:11.599-0500: 91.086: [GC (Allocation Failure) [PSYoungGen: 127845K->1390K(268288K)] 247489K->140704K(1666560K), 0.0107712 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2019-01-22T08:33:11.610-0500: 91.097: [GC (Allocation Failure) [PSYoungGen: 1390K->1401K(294400K)] 140704K->140715K(1692672K), 0.0037587 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
2019-01-22T08:33:11.614-0500: 91.100: [Full GC (Allocation Failure) [PSYoungGen: 1401K->0K(294400K)] [ParOldGen: 139314K->51057K(201728K)] 140715K->51057K(496128K), [Metaspace: 60831K->60827K(1101824K)], 0.0966803 secs] [Times: user=0.24 sys=0.01, real=0.09 secs]
2019-01-22T08:33:11.712-0500: 91.199: [GC (Allocation Failure) [PSYoungGen: 0K->0K(293888K)] 51057K->51057K(1692160K), 0.0100144 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:33:11.723-0500: 91.209: [Full GC (Allocation Failure) [PSYoungGen: 0K->0K(293888K)] [ParOldGen: 51057K->48333K(224768K)] 51057K->48333K(518656K), [Metaspace: 60827K->60134K(1101824K)], 0.2302426 secs] [Times: user=0.67 sys=0.01, real=0.23 secs]
Heap
PSYoungGen total 293888K, used 1071K [0x00000000d5580000, 0x00000000ee180000, 0x0000000100000000)
eden space 275968K, 0% used [0x00000000d5580000,0x00000000d568bfb8,0x00000000e6300000)
from space 17920K, 0% used [0x00000000e6300000,0x00000000e6300000,0x00000000e7480000)
to space 17408K, 0% used [0x00000000ed080000,0x00000000ed080000,0x00000000ee180000)
ParOldGen total 1398272K, used 48333K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
object space 1398272K, 3% used [0x0000000080000000,0x0000000082f335b0,0x00000000d5580000)
Metaspace used 60138K, capacity 60994K, committed 62464K, reserved 1101824K
class space used 9379K, capacity 9681K, committed 9984K, reserved 1048576K
worker.log.err
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Heap dump file created [965011634 bytes in 9.400 secs]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Unable to create artifacts/heapdump: File exists
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Unable to create artifacts/heapdump: File exists
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
.
robots.txt
User-agent: *
Crawl-delay: 10
# Directories
您是否尝试过使用 JHat 或 VisualVM 分析堆转储?
更新上面的堆转储表明内存已满,其中包含来自获取程序线程的内容。减少内容限制时您没有得到这一事实将证实这一点。如果可以或继续限制最大长度,请使用更多内存,您也可以并行使用更少的线程运行。
注意:如果您点击了无穷无尽的流,例如广播或视频,默认的 http 将继续加载内容,而不管设置的限制如何。 okhttp 实现在这方面更可靠。
更新:也许是 http.content.limit?我们将它设置为 -1,因为我们的提取器没有检索整个页面(由于我们其中一个站点页面顶部的大量菜单)。完全关闭它似乎是一个错误。我们已经将它设置为 http.content.limit: 5000000 (5MB) 并让它 运行。到目前为止没有错误...
=============
我们应该在堆转储中寻找什么? (我是 an_snatcher 的同事)我将最新的 heapdump 文件下载到我的本地计算机,并 运行 针对它的 Eclipse 内存分析器。我不知道如何从内存分析器导出数据,所以我将 post 截图它找到的图像,希望你能理解。它基本上说
"com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread @ 0x8138adb0 FetcherThread #27 Shallow Size: 144 B Retained Size: 709.4 MB"
以下是 Eclipse 内存分析器对堆转储文件的描述:
Eclipse Memory Analyzer image 01
Eclipse Memory Analyzer image 02
Eclipse Memory Analyzer image 03
Eclipse Memory Analyzer image 04
Eclipse Memory Analyzer image 05
Eclipse Memory Analyzer image 06
正在开发 Storm 爬虫 1.13 和弹性搜索 6.5.2。下面是我的爬虫配置。我正在抓取一个拥有数百万文档的网站。如果我通过应用 fast.urlfilter.json 执行特定于域的爬网,爬虫不会给我任何类型的错误。当我通过应用 "ignoreOutsideHost": false,"ignoreOutsideDomain": true 指向主域时,它会抛出 java.lang.OutOfMemoryError: Java 堆 space 和 由于内存不足错误而停止...FetcherThread #0。没有任何内存错误的平滑爬行的任何解决方案。 Click for crawler configuration 和 详细日志更新如下。
在此先感谢并为巨大的post道歉。
worker.log:
2019-01-22 08:31:51.989 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://arts.test.edu/login/?next=/schools/film-animation/other-school-film-and-animation-festivals-and-awards/test-film-and-animation-awards-1998 with status 200 in msec 107
2019-01-22 08:31:56.815 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://portfolios.test.edu/search?tags=Othello with status 200 in msec 162
2019-01-22 08:32:46.572 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://spiff.test.edu/richmond/testobs/jul25_2013/?C=S;O=A with status 200 in msec 3
2019-01-22 08:32:01.862 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://campusgroups.test.edu/slu/members/ with status 200 in msec 229
2019-01-22 08:32:06.693 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://arts.test.edu/news/16 with status 200 in msec 119
2019-01-22 08:32:11.601 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.apply.test.edu is set to 10000 as per robots.txt. url: https://www.apply.test.edu/news/testapply-holds-student-research-fair
2019-01-22 08:32:13.765 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.apply.test.edu/news/testapply-holds-student-research-fair with status 200 in msec 2164
2019-01-22 08:32:16.616 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://apps.test.edu/cos/scms/equipment/schedules.php?id=25&date=9-21-2019 with status 200 in msec 46
2019-01-22 08:32:21.780 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://edge.test.edu/edge/P19319/public/FILENAME.docx with status 200 in msec 156
2019-01-22 08:32:27.837 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://applywebdev.test.edu/news/booth-biography-selected-national-reading-project?page=6 with status 200 in msec 1231
2019-01-22 08:32:30.075 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://applywebdev.test.edu/news/grant-improve-problem-solving-skills-deaf-and-hard-hearing-students?page=6 with status 200 in msec 1235
2019-01-22 08:32:31.775 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://portfolios.test.edu/search?tags=feedback with status 200 in msec 197
2019-01-22 08:32:36.582 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: infoguides.test.edu is set to 10000 as per robots.txt. url: http://infoguides.test.edu/c.php?g=357360&p=4416876
2019-01-22 08:32:36.693 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://infoguides.test.edu/c.php?g=357360&p=4416876 with status 200 in msec 111
2019-01-22 08:32:41.602 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.sic.test.edu is set to 10000 as per robots.txt. url: https://www.sic.test.edu/news/sic-undergraduate-research-sparks-prestigious-professorship-astronomy?page=10
2019-01-22 08:32:42.455 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.sic.test.edu/news/sic-undergraduate-research-sparks-prestigious-professorship-astronomy?page=10 with status 200 in msec 853
2019-01-22 08:32:46.572 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://spiff.test.edu/richmond/testobs/jul25_2013/?C=S;O=A with status 200 in msec 3
2019-01-22 08:32:51.595 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.apply.test.edu is set to 10000 as per robots.txt. url: https://www.apply.test.edu/news/testapply-students-graduate-accolades
2019-01-22 08:32:53.748 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.apply.test.edu/news/testapply-students-graduate-accolades with status 200 in msec 2152
2019-01-22 08:33:01.976 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://inside.test.edu/?date=2023-12-1&t=list with status 200 in msec 355
2019-01-22 08:33:11.957 STDIO FetcherThread #0 [ERROR] Halting due to Out Of Memory Error...FetcherThread #0
2019-01-22 08:33:11.960 STDERR Thread-2 [INFO] java.lang.OutOfMemoryError: Java heap space
2019-01-22 08:33:11.968 STDERR Thread-2 [INFO] Dumping heap to artifacts/heapdump ...
2019-01-22 08:33:11.968 STDERR Thread-2 [INFO] Unable to create artifacts/heapdump: File exists
supervisor.log:
2019-01-22 08:31:40.341 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Created Worker ID da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] Setting up 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] SET worker-user da2944c7-cfd2-409a-856b-84f0a0014f56 testweb
2019-01-22 08:31:40.342 o.a.s.d.s.Container SLOT_6700 [INFO] Creating symlinks for worker-id: da2944c7-cfd2-409a-856b-84f0a0014f56 storm-id: www-staging-crawler-4-1548106042 for files(1): [resources]
2019-01-22 08:31:40.342 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Launching worker with assignment LocalAssignment(topology_id:www-staging-crawler-4-1548106042, executors:[ExecutorInfo(task_start:8, task_end:8), ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:9, task_end:9), ExecutorInfo(task_start:5, task_end:5)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:testweb) for this supervisor 164ddb0a-fcba-41e3-9a14-386248370bcf on port 6700 with id da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.342 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Launching worker with command: 'java' '-cp' '/home/testweb/apps/crawler/apache-storm-1.2.2/lib/*:/home/testweb/apps/crawler/apache-storm-1.2.2/extlib/*:/home/testweb/crawler/apache-storm-1.2.2/conf:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/stormjar.jar' '-Xmx64m' '-Dlogging.sensitivity=S3' '-Dlogfile.name=worker.log' '-Dstorm.home=/home/testweb/apps/crawler/apache-storm-1.2.2' '-Dworkers.artifacts=/home/testweb/var/logs/workers-artifacts' '-Dstorm.id=www-staging-crawler-4-1548106042' '-Dworker.id=da2944c7-cfd2-409a-856b-84f0a0014f56' '-Dworker.port=6700' '-Dstorm.log.dir=/home/testweb/var/logs' '-Dlog4j.configurationFile=/home/testweb/apps/crawler/apache-storm-1.2.2/log4j2/worker.xml' '-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector' '-Dstorm.local.dir=storm-local' 'org.apache.storm.LogWtester' 'java' '-server' '-Dlogging.sensitivity=S3' '-Dlogfile.name=worker.log' '-Dstorm.home=/home/testweb/apps/crawler/apache-storm-1.2.2' '-Dworkers.artifacts=/home/testweb/var/logs/workers-artifacts' '-Dstorm.id=www-staging-crawler-4-1548106042' '-Dworker.id=da2944c7-cfd2-409a-856b-84f0a0014f56' '-Dworker.port=6700' '-Dstorm.log.dir=/home/testweb/var/logs' '-Dlog4j.configurationFile=/home/testweb/apps/crawler/apache-storm-1.2.2/log4j2/worker.xml' '-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector' '-Dstorm.local.dir=storm-local' '-Xmx2048m' '-XX:+PrintGCDetails' '-Xloggc:artifacts/gc.log' '-XX:+PrintGCDateStamps' '-XX:+PrintGCTimeStamps' '-XX:+UseGCLogFileRotation' '-XX:NumberOfGCLogFiles=10' '-XX:GCLogFileSize=1M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:HeapDumpPath=artifacts/heapdump' '-Djava.library.path=/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/resources/Linux-amd64:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/resources:/usr/local/lib:/opt/local/lib:/usr/lib' '-Dstorm.conf.file=' '-Dstorm.options=' '-Djava.io.tmpdir=/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/tmp' '-cp' '/home/testweb/apps/crawler/apache-storm-1.2.2/lib/*:/home/testweb/apps/crawler/apache-storm-1.2.2/extlib/*:/home/testweb/crawler/apache-storm-1.2.2/conf:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/stormjar.jar' 'org.apache.storm.daemon.worker' 'www-staging-crawler-4-1548106042' '164ddb0a-fcba-41e3-9a14-386248370bcf' '6700' 'da2944c7-cfd2-409a-856b-84f0a0014f56'.
2019-01-22 08:31:40.344 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL_AND_RELAUNCH msInState: 18 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> WAITING_FOR_WORKER_START msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:45.350 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_WORKER_START msInState: 5006 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> RUNNING msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:12.328 o.a.s.d.s.BasicContainer Thread-2505 [INFO] Worker Process da2944c7-cfd2-409a-856b-84f0a0014f56 exited with code: 255
2019-01-22 08:33:12.370 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700: main process has exited
2019-01-22 08:33:12.370 o.a.s.d.s.Container SLOT_6700 [INFO] Killing 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:12.380 o.a.s.u.Utils SLOT_6700 [INFO] Error when trying to kill 1554. Process is probably already dead.
2019-01-22 08:33:15.380 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 90030 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> KILL_AND_RELAUNCH msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.381 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.394 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.395 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.395 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/pids/1554
2019-01-22 08:33:15.395 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/heartbeats
2019-01-22 08:33:15.399 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/pids
2019-01-22 08:33:15.399 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/tmp
2019-01-22 08:33:15.400 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers-users/da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID da2944c7-cfd2-409a-856b-84f0a0014f56
gc.log.0.current:
Java HotSpot(TM) 64-Bit Server VM (25.191-b26) for linux-amd64 JRE (1.8.0_191-b26), built on Oct 8 2018 13:54:08 by "java_re" with gcc 7.3.0
Memory: 4k page, physical 8168328k(1737328k free), swap 8387580k(8386288k free)
CommandLine flags: -XX:GCLogFileSize=1048576 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump -XX:InitialHeapSize=130693248 -XX:MaxHeapSize=2147483648 -XX:NumberOfGCLogFiles=10 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseGCLogFileRotation -XX:+UseParallelGC
2019-01-22T08:31:41.541-0500: 1.028: [GC (Allocation Failure) [PSYoungGen: 32768K->5096K(37888K)] 32768K->6882K(123904K), 0.0098372 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:31:42.155-0500: 1.642: [GC (Allocation Failure) [PSYoungGen: 37864K->5110K(37888K)] 39650K->10524K(123904K), 0.0104951 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:31:42.557-0500: 2.044: [GC (Metadata GC Threshold) [PSYoungGen: 24280K->5094K(37888K)] 29694K->12912K(123904K), 0.0129743 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2019-01-22T08:31:42.570-0500: 2.057: [Full GC (Metadata GC Threshold) [PSYoungGen: 5094K->0K(37888K)] [ParOldGen: 7817K->7345K(64000K)] 12912K->7345K(101888K), [Metaspace: 21023K->21023K(1067008K)], 0.0578299 secs] [Times: user=0.13 sys=0.01, real=0.06 secs]
2019-01-22T08:31:42.858-0500: 2.344: [GC (Allocation Failure) [PSYoungGen: 32768K->2425K(48128K)] 40113K->9771K(112128K), 0.0039971 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
2019-01-22T08:31:43.563-0500: 3.050: [GC (Allocation Failure) [PSYoungGen: 47993K->5099K(68096K)] 55339K->15796K(132096K), 0.0183739 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]
2019-01-22T08:31:44.248-0500: 3.735: [GC (Metadata GC Threshold) [PSYoungGen: 45605K->9669K(74752K)] 56303K->20375K(138752K), 0.0171562 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
2019-01-22T08:31:44.266-0500: 3.752: [Full GC (Metadata GC Threshold) [PSYoungGen: 9669K->0K(74752K)] [ParOldGen: 10705K->14480K(108032K)] 20375K->14480K(182784K), [Metaspace: 34870K->34870K(1079296K)], 0.1069368 secs] [Times: user=0.36 sys=0.01, real=0.11 secs]
2019-01-22T08:31:45.775-0500: 5.261: [GC (GCLocker Initiated GC) [PSYoungGen: 63488K->8826K(75776K)] 77975K->23321K(183808K), 0.0103824 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2019-01-22T08:31:46.619-0500: 6.106: [GC (Allocation Failure) [PSYoungGen: 72314K->12264K(90624K)] 86844K->30380K(198656K), 0.0228691 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
2019-01-22T08:31:47.414-0500: 6.901: [GC (Allocation Failure) [PSYoungGen: 90600K->15337K(93696K)] 108716K->33992K(201728K), 0.0215458 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
2019-01-22T08:31:47.499-0500: 6.986: [GC (Allocation Failure) [PSYoungGen: 93636K->14043K(110080K)] 112291K->32707K(218112K), 0.0191082 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
2019-01-22T08:31:47.565-0500: 7.052: [GC (Allocation Failure) [PSYoungGen: 106715K->13585K(111104K)] 125379K->32256K(219136K), 0.0110566 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2019-01-22T08:31:47.975-0500: 7.461: [GC (Allocation Failure) [PSYoungGen: 106257K->9626K(148480K)] 124928K->37589K(256512K), 0.0329521 secs] [Times: user=0.07 sys=0.02, real=0.03 secs]
2019-01-22T08:31:48.847-0500: 8.334: [GC (Metadata GC Threshold) [PSYoungGen: 120769K->5799K(149504K)] 148732K->123739K(344576K), 0.0346237 secs] [Times: user=0.07 sys=0.02, real=0.04 secs]
2019-01-22T08:31:48.882-0500: 8.369: [Full GC (Metadata GC Threshold) [PSYoungGen: 5799K->0K(149504K)] [ParOldGen: 117940K->115617K(263680K)] 123739K->115617K(413184K), [Metaspace: 57889K->57857K(1099776K)], 0.2179918 secs] [Times: user=0.66 sys=0.01, real=0.21 secs]
2019-01-22T08:31:56.805-0500: 16.291: [GC (Allocation Failure) [PSYoungGen: 131072K->4807K(189440K)] 246689K->120432K(453120K), 0.0092119 secs] [Times: user=0.03 sys=0.01, real=0.01 secs]
2019-01-22T08:32:11.898-0500: 31.385: [GC (Allocation Failure) [PSYoungGen: 181447K->1713K(195072K)] 297072K->120453K(458752K), 0.0062305 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:32:26.904-0500: 46.391: [GC (Allocation Failure) [PSYoungGen: 178353K->981K(234496K)] 297093K->120609K(498176K), 0.0048011 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2019-01-22T08:32:47.815-0500: 67.302: [GC (Allocation Failure) [PSYoungGen: 223701K->1518K(241664K)] 343329K->121154K(505344K), 0.0102639 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2019-01-22T08:33:07.716-0500: 87.203: [GC (Allocation Failure) [PSYoungGen: 194483K->1385K(262144K)] 314119K->121029K(525824K), 0.0059916 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:33:11.599-0500: 91.086: [GC (Allocation Failure) [PSYoungGen: 127845K->1390K(268288K)] 247489K->140704K(1666560K), 0.0107712 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2019-01-22T08:33:11.610-0500: 91.097: [GC (Allocation Failure) [PSYoungGen: 1390K->1401K(294400K)] 140704K->140715K(1692672K), 0.0037587 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
2019-01-22T08:33:11.614-0500: 91.100: [Full GC (Allocation Failure) [PSYoungGen: 1401K->0K(294400K)] [ParOldGen: 139314K->51057K(201728K)] 140715K->51057K(496128K), [Metaspace: 60831K->60827K(1101824K)], 0.0966803 secs] [Times: user=0.24 sys=0.01, real=0.09 secs]
2019-01-22T08:33:11.712-0500: 91.199: [GC (Allocation Failure) [PSYoungGen: 0K->0K(293888K)] 51057K->51057K(1692160K), 0.0100144 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-01-22T08:33:11.723-0500: 91.209: [Full GC (Allocation Failure) [PSYoungGen: 0K->0K(293888K)] [ParOldGen: 51057K->48333K(224768K)] 51057K->48333K(518656K), [Metaspace: 60827K->60134K(1101824K)], 0.2302426 secs] [Times: user=0.67 sys=0.01, real=0.23 secs]
Heap
PSYoungGen total 293888K, used 1071K [0x00000000d5580000, 0x00000000ee180000, 0x0000000100000000)
eden space 275968K, 0% used [0x00000000d5580000,0x00000000d568bfb8,0x00000000e6300000)
from space 17920K, 0% used [0x00000000e6300000,0x00000000e6300000,0x00000000e7480000)
to space 17408K, 0% used [0x00000000ed080000,0x00000000ed080000,0x00000000ee180000)
ParOldGen total 1398272K, used 48333K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
object space 1398272K, 3% used [0x0000000080000000,0x0000000082f335b0,0x00000000d5580000)
Metaspace used 60138K, capacity 60994K, committed 62464K, reserved 1101824K
class space used 9379K, capacity 9681K, committed 9984K, reserved 1048576K
worker.log.err
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Heap dump file created [965011634 bytes in 9.400 secs]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Unable to create artifacts/heapdump: File exists
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Unable to create artifacts/heapdump: File exists
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
.
robots.txt
User-agent: *
Crawl-delay: 10
# Directories
您是否尝试过使用 JHat 或 VisualVM 分析堆转储?
更新上面的堆转储表明内存已满,其中包含来自获取程序线程的内容。减少内容限制时您没有得到这一事实将证实这一点。如果可以或继续限制最大长度,请使用更多内存,您也可以并行使用更少的线程运行。
注意:如果您点击了无穷无尽的流,例如广播或视频,默认的 http 将继续加载内容,而不管设置的限制如何。 okhttp 实现在这方面更可靠。
更新:也许是 http.content.limit?我们将它设置为 -1,因为我们的提取器没有检索整个页面(由于我们其中一个站点页面顶部的大量菜单)。完全关闭它似乎是一个错误。我们已经将它设置为 http.content.limit: 5000000 (5MB) 并让它 运行。到目前为止没有错误...
=============
我们应该在堆转储中寻找什么? (我是 an_snatcher 的同事)我将最新的 heapdump 文件下载到我的本地计算机,并 运行 针对它的 Eclipse 内存分析器。我不知道如何从内存分析器导出数据,所以我将 post 截图它找到的图像,希望你能理解。它基本上说
"com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread @ 0x8138adb0 FetcherThread #27 Shallow Size: 144 B Retained Size: 709.4 MB"
以下是 Eclipse 内存分析器对堆转储文件的描述:
Eclipse Memory Analyzer image 01
Eclipse Memory Analyzer image 02
Eclipse Memory Analyzer image 03
Eclipse Memory Analyzer image 04
Eclipse Memory Analyzer image 05
Eclipse Memory Analyzer image 06