Murray Oldfield · Dec 20, 2020 go to post

I think Vic's links are a good place to start. Every application is such a snowflake that it is very hard to make blanket recommendations. The community link will give you a good idea of how system resources are used and what to monitor to be aware of whether your app is near limits of has resources tp spare that could be rightsized. The documentation links are good for Caché / IRIS general guidelines as well. 

Murray Oldfield · May 17, 2020 go to post

Hi, yes you can import using the system management portal; System >Classes Then import into %SYS.

Here is version information before;

%SYS>write $$version^SystemPerformance()14

After import, you can see the version information changed. Also note there was a conversion run. The custom profile I had created before the import existed after the update. 

%SYS>write $$version^SystemPerformance()
$Id: //iris/2020.1.0/databases/sys/rtn/diagnostic/systemperformance.mac#1 $
%SYS>d ^SystemPerformance
Re-creating command data for new ^SystemPerformance version.
Old command data saved in ^IRIS.SystemPerformance("oldcmds").
Current log directory: /path/path/iris/mgr/
Available profiles:
     1  12hours     - 12 hour run sampling every 10 seconds
     2  24hours     - 24 hour run sampling every 10 seconds
     3  30mins      - 30 minute run sampling every 1 second
     4  4hours      - 4 hour run sampling every 5 seconds
     5  5_mins_1_sec- 5 mins 1 sec
     6  8hours      - 8 hour run sampling every 10 seconds
     7  test        - A 5 minute TEST run sampling every 30 seconds

Select profile number to run: 5

Collection of this sample data will be available in 420 seconds.
The runid for this data is 20200518_094753_5_mins_1_sec.

%SYS>

You can also import from the command line;

USER>zn "%SYS"

%SYS>do $system.OBJ.Load("/path/SystemPerformance-IRIS-All-2020.1.0-All.xml","ck")

Load started on 05/18/2020 10:02:13
Loading file /path/SystemPerformance-IRIS-All-2020.1.0-All.xml as xml
Imported object code: SystemPerformance
Load finished successfully.

%SYS>
Murray Oldfield · Apr 15, 2020 go to post

Hi, I have not tested on 2020.1. Are you saying there is no change in any of the metrics after 40 seconds on a busy system?

Murray Oldfield · Dec 12, 2019 go to post

If you mean are there metrics for Ensemble, HealthShare, etc. Then no, not at the moment. However, the roadmap is there for this. 

You can add custom metrics though; IRIS Documentation. See section "Create Application Metrics".

This will be very powerful when you start to combine telemetry from all the services that make up an application; from the OS, IRIS, and the application.

Murray Oldfield · Dec 12, 2019 go to post

Hi Ron, I should have been clearer. The metrics are in a format to be consumed by Prometheus (or SAM). Once in Prometheus they go into a database that Grafana connects to as a Prometheus datasource. You want to do it this way to get the full functionality of Prometheus Queries + Grafana visualisation. We did try using a connector directly to IRIS but that really limits the functionality (was SimpleJSON). I will be publishing some example Grafana templates specific to IRIS soon. But the  Link here to Mikhail's post has an example of connecting to Grafana near the end.

Murray Oldfield · Nov 17, 2019 go to post

Hi Ashley, I don't do much with Windows, but a colleague offered the following as 'quick and dirty' examples. Perhaps, as this gets bumped to the front page of community because of the answer someone else can contribute a more functional example.

For your production use you will need to substitute your paths etc and add logging and perhaps enhance the error checking. So with all the usual caveats of test before use in production and so on;

Freeze Script

D:
CD D:\InterSystems\T2017\mgr
..\bin\cache -s. -B -V -U%%SYS ##Class(Backup.General).ExternalFreeze() 

if errorlevel 3 goto NEXT
if errorlevel 5 goto FAIL
goto END

:NEXT
CD D:\InterSystems\HS20152\mgr
..\bin\cache -s. -B -V -U%%SYS ##Class(Backup.General).ExternalFreeze() 

if errorlevel 3 goto OK
if errorlevel 5 goto FAIL
goto END

:OK
Echo SYSTEM IS FROZEN
exit 0

:FAIL
echo ERROR
exit 1

:END
exit 1

Thaw Script

D:
CD D:\InterSystems\HS20152\mgr
..\bin\cache -s. -B -V -U%%SYS ##Class(Backup.General).ExternalThaw()

CD D:\InterSystems\T2017\mgr
..\bin\cache -s. -B -V -U%%SYS ##Class(Backup.General).ExternalThaw()

exit 0
Murray Oldfield · Sep 11, 2019 go to post

ahh, OK, I don't know why I got primary/alternate in my head... so the example is the standard message;

Journaling switched to:  ...

Murray Oldfield · Sep 11, 2019 go to post

Simpler to watch the cconsole.log?

I can't find an example but; test if the primary and alternate are on different paths?

06/23/18-19:37:30:760 (19971) 0 CACHE JOURNALING SYSTEM MESSAGE
Journaling switched to: /trak/site/live/jrnpri/MIRROR-TCMIRROR-20180623.010

Murray Oldfield · Sep 4, 2019 go to post

Ignoring whether there is a %SYSTEMJournal:IsPrimary() or some such (I simply don't know). If primary and alternate are on their own separate disk devices (dev/sdj/pri_journals) you will see writes only on one of them. Not very bulletproof but depends on what you are looking for.

Murray Oldfield · Jul 30, 2019 go to post

Hi, good question. The answer is the typical consultant answer... it depends. The temptation is to offer a "Best Practice" answer, but really, there are no best practices, just what's best for you or your customer's situation. If your storage performance is OK then keep monitoring, but you don't have to change anything. If you are having storage performance problems, or your capacity planning says you will need to scale and optimise, then you need to start looking at strategies available to you. Direct IO is one, but there are others. What you have prompted me to do is think about a community post bringing together storage options, especially now as we have moved into a time of all-flash SSD, NVMe, Optane..... So, I got this far without any answer at all...

A quick summary, because it will take a while to write a new post. Direct I/O is a feature of the file system whereby file reads and writes go directly from the application to the storage device, bypassing the operating system read and write caches. Direct I/O is used only by applications (such as databases) that manage their own caches. 

For Caché and IRIS Journals already do direct IO to ensure the journal really is persisted to disk, not in a buffer. 

InterSystems do recommend Direct I/O in some situations specifically, for example for HCI on Linux because we do need to optimise IO on these platforms (vSAN, Nutanix, etc). See the HCI Post. Direct I/O is enabled for reads AND writes with the [config] setting wduseasyncio=1. This also enables asynchronous writes for the write daemon. There can be situations, like Caché online backup that is doing a lot of sequential writes, or where there is continuous database writes, like a database build with a lot of database expansions where OS write cache is an advantage. So don't think Direct I/O is an answer to every situation. If you are using a modern backup technology like snapshots then you will be fine tho'.

MO

Murray Oldfield · Jul 26, 2019 go to post

On Linux use Asynchronous IO (asyncio) for RANREAD testing.
asyncio  enables direct IO for database reads and writes which bypasses file cache at the OS and LVM layers 
NOTE: Because direct IO bypasses filesystem cache, OS file copy operations including Caché Online Backup will be VERY slow when direct IO is configured.


Add the following to [config] section of the cache.cpf and restart Caché/HealthShare/IRIS: 
wduseasyncio=1
 


It might be helpful if you just have a 15-minute pButtons to run while RANREAD runs to see operating system io stats, eg iostat.
From zn "%SYS"

%SYS>set rc=$$addprofile^pButtons("15_minute","15 minute only", "1", "900")

%SYS>d ^pButtons
Current log directory: /trak/backup/benchout/pButtonsOut/
Available profiles:
     1  12hours     - 12 hour run sampling every 10 seconds
     2  15_minute   - 15 minute only
     3  24hours     - 24 hour run sampling every 10 seconds
     4  30mins      - 30 minute run sampling every 1 second
     5  4hours      - 4 hour run sampling every 5 seconds
     6  8hours      - 8 hour run sampling every 10 seconds
     7  test        - A 5 minute TEST run sampling every 30 seconds

select profile number to run:

Murray Oldfield · Jun 25, 2019 go to post

Hi all, I have been advised that the rtkaio library has been discontinued in the SUSE distribution since SUSE 9, so you cannot use the rtkaio libraray on SUSE. Specifically for SUSE do NOT add the following to the cache.cpf file. Also note that the rtkaio library is not needed for IRIS -- only Caché.

LibPath=/lib64/rtkaio/

Murray Oldfield · Jun 12, 2019 go to post

Hi Rich, the short story is that I have fixed this an pushed to GitHub. A new container version will appear very soon.

The problem is caused by an unexpected date format in the "Profile" section of the pButtons HTML file.  Year format yy instead of expected yyyy. 

As a workaround, you can open the HTML file in a text editor and edit the date manually to be the expected format. eg

Profile run "24hours_5" started at 00:00:00 on Jun 08 19.

change to:

Profile run "24hours_5" started at 00:00:00 on Jun 08 2019.

Thanks for bringing this to my attention.!

Murray Oldfield · Nov 14, 2018 go to post

Hi, there are several posts by Mark Bolinsky on cloud architectures including DR. Probably simplest to search for his name and browse through the articles for a good overview.

Murray Oldfield · May 15, 2018 go to post

For newer storage, especially all flash, 10,000 iterations will be too quick, change this to 100,000 for sustained read activity -- it should be less than a minute on SSD storage for each step. For example using the above example;

for i in `seq 2 2 30`; do echo "do ##class(PerfTools.RanRead).Run(\"/db/RANREAD\",${i},100000)" | csession CACHEINSTNAME -U "%SYS"; done
Murray Oldfield · Mar 15, 2018 go to post

hmmm... it worked for me just now.... I did notice when I exported from the markdown editor I used it came across as "\<!--break--\> escaping the slash, but edit to "<!--break-->"  worked OK. on the post about minimum monitoring and alerting solution. To be honest I had not tried until you mentioned it. haha yes switch MD to WYSWIG is a big mistake :(

Murray Oldfield · Nov 8, 2017 go to post

Hi, I think you have got it. Mark and I are saying the same thing. Sometimes a picture helps. The following is based on some presentations I have seen from Intel and VMware. It shows what happens when multiple VMs are sharing a host, which is why we recommend a reserved instance. But this also illustrates hyper-threading in general.

So to recap; A hyper-thread is not a complete core. A core is a core is a core, hyper-threading does not create more physical cores. The following picture shows a workload going through 4 processing units... micro execution units in a processor. 

One VM is green a second is purple, white is idle. A lot of time is spent idle waiting for memory cache, IO, etc.

As you can see with hyper-threading on the right  you don’t suddenly get two processors, not 2x, and expectation is  ~10-30% improvement in processing time overall. 

The schedular can schedule processing when there idle time, but remember on a system under heavy load CPU utilisation will be saturated anyway so there are less white idle time.