Ephraim,

When you say "corrupted" to better understand...
- Did you try to mount the DB (from the SMP of with ^MOUNT)? Sometimes if IRIS/Cache was "forced" than a *.lck file on the DB folder need to be deleted in order to allow a successful mount. 
- If the DB is mounted, did you got a <DATABASE> (or other) error? if so, then what was said using ^Integrity and ^Repair could help - but only if you fully understand how to use those tools (!) Most of the time, a corrupted DB is fixable using those tools, or at least data can be 99% recovered. Depending on the number of errors: if its huge than sometimes it is faster to recover the DB from a valid backup + journal files. 

BTW - if this is a mirrored DB than there are other considerations as well. 

Happy new year!

On a given machine any process can run "as fast" as the CPU clock rate (higher = faster = more operations/sec.)

It is true that a single process can do approx. 15-20 MB/sec. (depends on the CPU clock rate & the disk type: SSD, Premium SSD, Nvme)

The best way to overcome this limitation is do to a "heavy I/O" processes in parallel using the queue manager.
On machines with 16/32 cores, you may achieve your "infrastructure limits" (160MB/sec) easily and even more (we managed to go to a 1000MB/sec of Nvme disks)

The "write daemon" is a process that is responsible to write all new/changed data to the disk where the WIJ (write image journal) file is located. Only then, actual databases are being updated.

Sometimes, when the system is very busy with a lot of pending writes this error appears, and then after few minutes it is cleared automatically. (e.g. when you rebuild a huge index, or do some data migration process).

I would monitor the disk activity for the disk that the WIJ file is located on (by default its on the same disk you installed Cache).

One solution is to move the WIJ to a different disk, less occupied. This will give the "write daemon" more writing capabilities (you will have to restart Cache).    

As you probably know the Intersystems IRIS is capable of doing calculations on a numbers in length of up to 19 positions. This is due to the fact that it is stored as a signed 64–bit integer.

If you are using one of the latest version of IRIS which has an "embedded Python" then you will have the Python support for "bignum" or "long".
So you may have a ClassMethod like this:

ClassMethod BigNumbers1() [ Language = python ]
{
9223372036854775807
9223372036854775808
9223372036854775807123456789
print(a)
print(b)
print(c)
print(c%2)
}
 

Which will give you the output:

9223372036854775807
9223372036854775808
9223372036854775807123456789
1

SAM is executing a REST API call to http://[your-server]/api/monitor/metrics for any server included in your cluster. I'm not sure where the interval is being configured.

If your own metric is needed to be run once a day, you can schedule it with the 'task manager" have the result stored in a global, and let the "user defined" sensor read this global which will not caused any performance issue.

BTW - one thing I forgot to mention in the previous post is:
In order to have SAM run your "user defined" metrics, you need to add it to SAM:

%SYS>Set sc = ##class(SYS.Monitor.SAM.Config).AddApplicationClass("MyClass", "namespace")

First create a class:

Class MyClass Extends %SYS.Monitor.SAM.Abstract

Add a parameter that will indicate the prefi name for al your used defined metrics.

Parameter PRODUCT = "Prefix";

Create a wrap method GetSensors() for all your user defined sensors (which can be ClassMethods):

Method GetSensors() As %Status
{
Try {
   D ..SetSensor("sensor1", ..Sensor1())
   D ..SetSensor("sensor2", ..Sensor2())
   }
Catch e { ;call your store error function g.e. ##class(anyClass).StoreError($classname(),e.DisplayString()) }
}
ClassMethod Sensor1() As %Integer
{
   ; do any calculation
   Quit Value
}
ClassMethod Sensor1() As %Integer
{
   ; do any calculation
   Quit Value
}
}
 

Now you will get by the API REST call for the /api/mertics your "user defined" sensors at names:
Prefix_sensor1 and Prefix_sensor2

Remarks:
- Make sure that your GetSensors() and all your "used defined" sensors (classmethods) have a proper error handling so they are fail safe (you may use a try/catch or any other error trap like $ZT="something")
- Make sue all your "user defined" sensors are preforming fast. This will enable the SAM metrics API REST call to get the data quickly without delays. In case some calculations are "heavy" it is better to have a separate process (task manager) to do those calculations, and store them in a global for fast data retrieval by the sensor   

When you install SAM, it is usually installed with a container. (we use docker, so I don't have experience with podman).  

We have this on a seperate machine (Linux) when our IRIS servers are Windows, but I don't see any limitation (except memory & CPU = performance) to run a container with SAM on the same IRIS machine.

 Grafana, and Promethus are part of the "bundle" (container) for SAM do you do not need to install them separately.

get the mirror name:

ZN "%sys"
Set mirrorName=$system.Mirror.GetMirrorNames()
Set result = ##class(%ResultSet).%New("SYS.Mirror:MemberStatusList")
Set sc = result.Execute(mirrorName)
while result.Next() {
   Set transfer=result.GetData(6)
   // you may filer the check for a specific machine on GetData(1)
   // Do any check on "transfer" to see if behind and calculate the threshold time e.g.
  // For i=1:1:$l(transfer," ") {  
  //    If $f($p(transfer," ",i),"hour") { W !,"hour(s) behind" }
  //      Elseif $f($p(transfer," ",i),"minute") { Set minutes=$p(transfer," ",i-1) W !,minutes_" minutes behind" }
  //   }
}

To get any component status, you may use:

SELECT Name, Enabled FROM Ens_Config.Item where Name['Yourname'

To check queues you may use the following sql :

select Name,PoolSize from ENS_Config.Item where Production='YourProductionName'

Then, iterate on the result set and get the queue depth by:

Set QueueCount=##class(Ens.Queue).GetCount(Name)

To check latest activiy on a component, I would go to the:

SELECT * FROM Ens.MessageHeader where TargetQueueName='yourComponentName'

and then, to check the TimeProcessed

You have to distinguish between "journals" and "mirror-journals" files. the 1st are to ensure any instance DB integrity (DB corruption) in case of a failure. The 2nd are you ensure proper mirror "failover" and any A-Sync members.

When LIVETC01 (as a Backup) is "catch up" its a good source to copy .DAT files to the LIVEDR.
It is also safe to delete its mirror-journals.
The steps you did to catch up the LIVEDR are correct. (I assume you did "activate" & "catch up" after that in LIVEDR)

After the IRIS.DAT copy (of all DBs in mirror) from LIVETC01 to LIVEDR and both are "catch up" - It is safe to delete mirror-journals up to the point of the copy from your primary LIVETC02

Hello,


The best way it to do it is to use the dictionary to loop on properties of the original class and create a new class  which is identical, but with a different storage. The cloning is done by using %ConstructClone
Usually, the new class for backup, does not need to have methods, indices or triggers, so those can be "cleaned" before saving it.

Have the original and the destination class objects:

S OrigClsComp=##class(%Dictionary.CompiledClass).%OpenId(Class)
S DestCls=OrigCls.%ConstructClone(1)

You should give the destination class a name and type:

S DestCls.Name="BCK."_Class , DestCls.Super="%Persistent"

Usually the destination class does not need to have anything than the properties, so in case there are methods, triggers or indices that need to be removed from the destination class, you may do: 

F i=1:1:DestCls.Methods.Count() D DestCls.Methods.RemoveAt(i)      ; clear methods/classmethods
F i=1:1:DestCls.Triggers.Count() D DestCls.Triggers.RemoveAt(i)     ; clear triggers
F i=1:1:DestCls.Indices.Count() D DestCls.Indices.RemoveAt(i)       ; clear indices

Setting the new class storage:

S StoreGlo=$E(OrigCls.Storages.GetAt(1).DataLocation,2,*)
S StoreBCK="^BCK."_$S($L(StoreGlo)>27:$P(StoreGlo,".",2,*),1:StoreGlo)

S DestCls.Storages.GetAt(1).DataLocation=StoreBCK
S DestCls.Storages.GetAt(1).IdLocation=StoreBCK
S DestCls.Storages.GetAt(1).IndexLocation=$E(StoreBCK,1,*-1)_"I"
S DestCls.Storages.GetAt(1).StreamLocation=$E(StoreBCK,1,*-1)_"S"
S DestCls.Storages.GetAt(1).DefaultData=$P(Class,".",*)_"DefaultData"

Then just save the DestCls

S sc=DestCls.%Save()

Usually the GREF is the total number of global references (per second). A given process can do a limited number of I/O operations/sec (this is due to the CPU clock speed).
When there are bottlencecks, there are some tools that can tell you which part of your system (or code) can be improved. Monitoring with SAM or other tools can give you some numbers to work with. there is also a %SYS.MONLBL that can help you improve your code.
Storage is also a consideration, sometimes a DB can be optimized to store data in a more compact way and save I/O (especially when you are on the cloud when disks are somehow slower than you have on premisse). 
One easy improvment is to do some "heavy" parts on your system (e.g. reports, massive data manipulatipons etc.) in parallel. This can be done with using the "queue manager" or with the %PARALLEL keyword fr SQL queries.
A more complex way to go is to do a vertical or horizental scale of the system, of even sharding.