IRIS mirrored went into a hung state

Question

Question

Phillip Wu · Jan 19, 2023

I have the following servers in IRIS mirror set:
Arbiter; isc_agent only
LIVETC01; IRIS DB full install; Primary
LIVETC02; IRIS DB full install; Backup

A couple of days ago IRIS hung.
The application using LIVETC01 DB stopped functioning.

I'm trying to find out the sequence of events leading up to the failure.

I see these entries in the log:
Arbiter:
2023-01-17T15:54:56 ISCAgent: Arbiter client error: Message read failed.
2023-01-17T15:54:56 ISCAgent: Completed serving application: ISC1ARBITER
2023-01-17T15:54:56 ISCAgent: Arbiter client error: Message read failed.
2023-01-17T15:54:56 ISCAgent: Completed serving application: ISC1ARBITER

LIVETC01:
01/17/23-15:56:22 Arbiter connection lost
01/17/23-15:56:23 MirrorServer: Received new failover mode (Agent Controlled) from backup...(repeated 1 times)
01/17/23-15:56:41 ECP: connection from 'LIVETC01' dropped
01/17/23-15:56:42 MirrorServer: Mirror entered trouble state
01/17/23-15:56:42 MirrorServer: Mirror id #0 set trouble state

LIVETC02:
01/17/23-15:56:23 MirrorClient: Switched from Arbiter Controlled to Agent Controlled failover on request from primary
01/17/23-15:56:41 ECP: Mirror Connection request from 'TRAK-LIVETC01'
01/17/23-15:56:41 ECPs: Conn redirected to primary mirror server @ (LIVETC01)
01/17/23-15:56:44 MirrorClient: The backup node has become inactive from a status query
01/17/23-15:56:51 MirrorClient: The backup node has become active
01/17/23-15:56:51 MirrorClient: The backup node has become inactive from a ping
01/17/23-15:56:53 MirrorClient: The backup node has become active
01/17/23-15:57:12 Mirror Connection request from 'LIVETC01'
01/17/23-15:57:12 ECPs: Conn redirected to primary mirror server @ (LIVETC01)
01/17/23-15:57:13 Mirror Connection request from 'TRAK-LIVETC01'
01/17/23-15:57:13 Conn redirected to primary mirror server @ (LIVETC01)
01/17/23-15:57:20 MirrorClient: Primary AckDaemon failed to answer status request.

Questions
=========
1. It looks like the Arbiter failed. Would you agree?
2. Can I find out why the Arbiter had a "Message read failed"
Is the "Message read failed" an I/O error or a timeout?
3. It looks like the primary (LIVETC01) detected the Arbiter failed.
Then primary told backup that the Arbiter failed.
Would you agree?
4. Does anyone know why backup node goes inactive then active 5 seconds later?
5. Does anyone know why the "Primary AckDaemon failed to answer status request"?
6. I don't think at any point the backup took over as primary.
Would you agree?

Thanks for any help

Product version: IRIS 2022.2

Discussion (2)1

Log in or sign up to continue

score 0 · Answer 1 · 2023-01-19T04:46:46-05:00

Yes the arbiter was unable to communicate. It looks like a network issue.
I recommend you to open a WRC for that

score 0 · Answer 2 · 2023-01-20T18:57:21-05:00

Phillip Wu · Jan 20, 2023

Thanks for the advice

0 0