磁盘已满使 MQ 死机

Question

我们有一个使用 WebSphere MQ 7.0.1.3 的应用程序。在我们的舞台环境中进行大量测试时，磁盘已满。

此后MQ就挂了。我们删除了应用程序日志（与 MQ 无关）并添加了更多磁盘，但并没有解决问题。

我们尝试重新启动队列管理器：

$ endmqlsr
$ endmqm XYZ
$ strmqm XYZ
WebSphere MQ queue manager 'XYZ' starting.
WebSphere MQ was unable to display an error message 893.

磁盘已满并发生错误时的日志：

----- amqxfdcx.c : 828 --------------------------------------------------------
06/08/2018 03:36:44 AM - Process(8832.5) User(mqm) Program(amqzlaa0)
AMQ6119: An internal WebSphere MQ error has occurred (Rc=28 from write)
----- amqxfdcx.c : 783 --------------------------------------------------------
06/08/2018 03:36:44 AM - Process(8832.5) User(mqm) Program(amqzlaa0)
AMQ6184: An internal WebSphere MQ error has occurred on queue manager XYZ.
----- amqxfdcx.c : 822 --------------------------------------------------------
06/08/2018 03:36:46 AM - Process(8832.5) User(mqm) Program(amqzlaa0)
AMQ6119: An internal WebSphere MQ error has occurred (Rc=28 from write)
----- amqxfdcx.c : 783 --------------------------------------------------------
06/08/2018 03:36:46 AM - Process(8832.5) User(mqm) Program(amqzlaa0)
AMQ6184: An internal WebSphere MQ error has occurred on queue manager XYZ.
AMQ6119: An internal WebSphere MQ error has occurred ('28 - No space left on device' from semget.)
----- amqxfdcx.c : 783 --------------------------------------------------------
06/14/2018 02:35:46 PM - Process(6794.1) User(mqm) Program(amqzxma0)
AMQ6184: An internal WebSphere MQ error has occurred on queue manager XYZ.
----- amqxfdcx.c : 822 --------------------------------------------------------
06/14/2018 02:35:46 PM - Process(6794.1) User(mqm) Program(amqzxma0)
AMQ6118: An internal WebSphere MQ error has occurred (20006037)

尝试连接 IBM WebSphere MQ Explorer 时

Queue manager not available for connection - reason 2059. (AMQ4043)
Severity: 20 (Error)
Explanation: The attempt to connect to the queue manager failed. This could be because the queue manager is incorrectly configured to allow a connection from this system, or the connection has been broken.
Response: Ensure that the queue manager is running. If the queue manager is running on another computer, ensure it is configured to accept remote connections.

是否有一种方法可以清除队列中的所有消息并重置所有标志，以便队列管理器启动并且队列再次工作？

队列中只有旧的测试数据，没有任何价值。

或者您对如何解决此问题有任何其他建议吗？

Answer 1

您可以使用 mqrc 命令提供有关错误的更多信息。大多数情况下，MQ 将 return 代码报告为四位十进制数。在这种情况下，由于 return 代码是三位数字，它通常（总是？）意味着它是一个十六进制 return 代码。

$ mqrc 2195

      2195  0x00000893  MQRC_UNEXPECTED_ERROR

当 MQ 遇到非预期的错误条件时，会抛出此错误。通常您会发现在 /var/mqm/errors 目录中创建了一个 FDC 文件，可以提供更多详细信息。

当您收到此类错误时，最好的做法是打开 IBM 的 PMR，让他们提供恢复指导，以确保您有最好的机会保留队列中可能存在的消息，但是您使用的 MQ (7.0) 版本自 2015 年 9 月 30 日起不再受支持。您使用的特定修复包 (7.0.1.3) 于 2010 年 8 月发布。IBM 的 v7.0 的最新版本是 7.0 .1.14 2016 年 8 月。

如果您向 IBM 付费以获得扩展支持，您可以与他们一起打开 PMR 以获得进一步的支持。

解决问题后的最佳途径是迁移到受支持的 IBM MQ 版本。目前 v8.0 和 v9.0 是目前唯一受支持的 IBM MQ 版本。

假设您没有扩展支持并且无法从 IBM 获得帮助，以下是一些建议的步骤：

更新到最新的 Fix Pack (7.0.1.14) 可能会有所帮助，如果它不能解决问题，最好使用不受支持的 IBM 版本的最新 Fix Pack MQ.
您可以尝试冷启动您的队列管理器，看看是否有帮助。这是从演示文稿第 4 页开始记录的 "WebSphere MQ Disaster Recovery" given by Mark Taylor at Capitalware's MQ Technical Conference v2.0.1.3.

Create a queue manager EXACTLY like the one that failed
Use qm.ini to work out parameters to crtmqm command
Log:
  LogPrimaryFiles=10
  LogSecondaryFiles=10
  LogFilePages=65535
  LogType=CIRCULAR
Issue the crtmqm command

crtmqm -lc -lf 65535 -lp 10 -ls 10 –ld /tmp/mqlogs TEMP.QMGR

Make sure there is enough space for the new log files in that directory

Name of the dummy queue manager is irrelevant

Only care about getting the log files

Don’t start this dummy queue manager, just create it
Replace old logs and amqhlctl.lfh with the new ones
cd /var/mqm/log
mv QM1 QM1.SAVE
mv /tmp/mqlogs/TEMP!QMGR QM1
Note the “mangled” directory name … this is normal
Data in the queues is preserved if messages are persistent

Object definitions are also preserved

Objects contain their own definitions in their files

Mapping between files and object names held in QMQMOBJCAT

完成上述所有操作后，尝试启动您的队列管理器。

磁盘已满使 MQ 死机

Disk full made MQ dead

ibm-mq