Hi all, Setting a startup routine like this can't lock non-terminal based Cache functionality (CSP, etc) as the routine is started on terminal kind of logins only, while I agree that LOGIN^%ZSTART should be sufficient in most cases. IMHO, this per user settings of startup routine in SMP comes from "the good old days" of terminal oriented apps which were often tied to terminals to prevent end users from entering MUMPS commands. = Cheers, Alex
Thank you, Ray. As to our modest experience, WAN failure was the only type of disasters through two years of Krasnoyarsk Data Center of Regional HIS production life, so its potential network isolation does not seem to be non-realistic. We hope that the mentioned enhancement would not come too late.
Are there any plans to introduce in 2017.1 a feature of Quick Old Primary Switching Back which seems to be of great importance for the scenario of temporary move of the Primary role to Promoted Async DR? It is known as Prodlog 142446. In a few words:
After the promotion of DR Async without partners check, the old Primary / Backup members functionality would be likely restored only after rebuilt. Copying the ~1TB backup using long distance link can take many hours or even days, so it will be nice to track a point when the databases were last time "in sync" (while I'm not sure if this term can be used in the case of DR async). After that: - discard the SETs/KILLs that could be made on old primary after this point - demote (?) it to be a new DR async - having the most recent journal on this new DR async, we can promote it to be a very new primary (returning it its "native" role).
you would compare the name and date of the most recent journal file from the async to the most recent journal file on the DR you are going to promote to see if you can get more recent journal data, which may not be the most recent.
At the meantime we are (internally) discussing the worst case of complete isolation of main Data Centre, so both Member A and B can be not available. In this case the only thing we can do is to check if the DR Async we are going to promote has the more recent journal data among all other available Asyncs, right?
The version of Caché is 2015.1.4 for Windows x64, on Core i5 based laptop. When I have a time gap, I'd re-run this test on more powerful server, while I don't expect a noticeable difference.
Let's try to estimate time for both operations: 1: set max(args(i))=i ~ time_to_find_args(i)_position_in_max_array + time_to_allocate_memory_for_max(args(i)) ~ O(ln(length(args(i))) + O(length(args(i)+length(i))) ~ O(3*ln(length(args(i))) 2: max<args(i) ~ time_to_compare_max_and_args(i) ~ O(ln(length(args(i))))
So it seems that 2 should be ~3 times quicker than 1, but we don't know real coefficients which stand behind those O() estimations. I should confess that local array node allocation penalty turned to be higher than I expected.
This speed difference should be even more would args(i) values be strings rather than numbers.
1) There is an option of DR Promotion and Manual Failover with Journal Data from Journal Files ( http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=... ) where one is advised to get journal files from one of failover members. If both failover members are not available, is it possible to get journal files from another (DR or reporting) Async member? If so, what preliminary configuration steps one should proceed on that member to allow this option in the case of disaster?
2) Another question is on journal files collection as well. You wrote:
Asyncs receive journal data from the primary asynchronously, and as a result may sometimes be a few journal records behind
Is it true that Asyncs pulls journal data from Primary? If so, Primary is not aware of whether data was pulled by Async or not. Therefore, if Async was not getting journal data from Primary for several days (e.g. due to communication problems), the next unread journal file can be already purged from Primary's local storage.
Is it possible to recover from this situation without rebuilding Async? E.g., if the purged journals are available as a part of file system level backup, or those files are kept on another Async server, can it help?
I agree, it's elegant, but not memory efficient design: the sorted array is built only to pick up its top node. Of course, arg lists are rarely long and yet another small array doesn't mean much, but in more generic case of searching the max element in array of sufficient size the following code should be twice more memory efficient and maybe a bit faster:
ClassMethod max(args...) {
s max=$g(args(1)),im=1 for i=2:1:args s:max<$g(args(i)) max=args(i),im=i
q $lb(max,im)
}
P.S. I was curious if my variant was really faster and ran some tests. Here are the results that were collected using an array filled with random whole numbers. An array was refilled on each test run.
In the table below: min - the lower limit of numbers values in the array max- the upper limit of numbers values in the array n - the quantity of numbers in the array var - # of variant (1 - original, 2 - mine) dt - average run time (in seconds) dtfill - avg time of filling the array; just for info.
Using a repository is a good idea for sure, but what about a solution that can help even if an 'intruder' had bypassed it and changed a class, e.g., on production server? Here is one which answers who changed SomeClassName.CLS; this code can be executed in "%SYS" namespace using System Management Portal/SQL:
SELECT DISTINCT TOP 1 s.UTCTimeStamp, s.OSUsername, s.Username, s.Description
FROM %SYS.Audit as s
WHERE s.Event='RoutineChange'
AND s.Description LIKE '%SomeClassName.cls%'
ORDER BY s.UTCTimeStamp desc
It's easy to adapt it for searching the same info for MAC, INT and INC routines. Enjoy!
Sometimes such strange results are caused by ignoring the fact that usually there are several levels of caching, from high to low:
- Caché global cache
- filesystem cache (on Linux/UNIX only, as Windows version uses direct i/o)
- hdd controller cache.
So even restarting Caché can be not enough to drop the cache for clear "cold" testing. The tester should be aware of data volume involved, it should be much more than hdd controller cache (at least). mgstat can help to figure this out, besides it can show when you start reading data mostly from global cache rather than from filesystem/hdd.
Hi Murray, thank you for keep writing very useful articles.
ECP is a rather complex stuff and it seems it does worth addition writing.
Just a quick comment to your point:
For sustained throughput average write response time for journal sync must be: <=0.5 ms with maximum of <=1 ms.
How can one distinguish journal syncs from other journal records looking at iostat log only? It seems that 0.5-1ms limit should be applied to every journal write, not only to sync records.
And a couple of small questions. You wrote that 1) "...each SET or KILL the current journal buffer is written (or rewritten) to disk. " and 2) "On very busy systems journal syncs can be bundled or deferred into multiple sync requests in a single sync operation." Having mgstat logs for a (non-ECP) system, is it possible to predict future journal syncs rate after scaling horizontally to ECP cluster? E.g., if we have average and peak mgstat Gloupds values, can we predict future journal syncs rate? What is the top rate of journal syncs when their bundling/deferring begins?
However, the Newbie can ignore it all, by using Caché SQL
If so, how do you answer the curious Newbie's question: why should I use Caché at all, as a few SQL implementations are available for free nowadays?
Usually those questions were answered like this: Caché provides Unified Data Architecture that allows several access methods to the same data (bla-bla-bla), and the quickest of them is Direct Global Access. If we answer this way, we should teach how to traverse across the globals, so you are doing the very right and useful thing! There is only one IMHO: semantics can be more difficult to catch than syntax. Whether one writes `while (1) { ... }` or `for { ... }`, it's basically all the same, while using $order or $query changes traverse algorithm a lot, and it seems that this stuff should be discussed in more details.
Basic and advanced mode were in an old version of another tool named ^Buttons. With ^pButtons you have an option to reduce the number of OS commands being performed, as it was shown in Tip #4.
Checking .LCK files is useless in most cases as Caché service auto-starts with OS startup. Of course, switching auto-start off is not a problem for development/testing environment.
Frank touched another interesting question: how long WaitToKillServiceTimeout should be? If we set it to ShutdownTimeout + Typical_Real_Shutdown_Time, and Caché hangs during OS shutdown, I bet that typical Windows admin won't wait 5 minutes and finish with hardware reset... Choosing between bad and worse, I'd set WaitToKillServiceTimeout = Typical_Real_Shutdown_Time letting OS to force Caché down in rare cases when it hangs.
go to post
Hi all,
Setting a startup routine like this can't lock non-terminal based Cache functionality (CSP, etc) as the routine is started on terminal kind of logins only, while I agree that LOGIN^%ZSTART should be sufficient in most cases. IMHO, this per user settings of startup routine in SMP comes from "the good old days" of terminal oriented apps which were often tied to terminals to prevent end users from entering MUMPS commands.
=
Cheers,
Alex
go to post
Hi Michael,
As I recall each SWD supports dedicated kind of hdd devices (different at the OS level).
go to post
Thank you, Ray.
As to our modest experience, WAN failure was the only type of disasters through two years of Krasnoyarsk Data Center of Regional HIS production life, so its potential network isolation does not seem to be non-realistic.
We hope that the mentioned enhancement would not come too late.
go to post
Are there any plans to introduce in 2017.1 a feature of Quick Old Primary Switching Back which seems to be of great importance for the scenario of temporary move of the Primary role to Promoted Async DR? It is known as Prodlog 142446. In a few words:
After the promotion of DR Async without partners check, the old Primary / Backup members functionality would be likely restored only after rebuilt. Copying the ~1TB backup using long distance link can take many hours or even days, so it will be nice to track a point when the databases were last time "in sync" (while I'm not sure if this term can be used in the case of DR async). After that:
- discard the SETs/KILLs that could be made on old primary after this point
- demote (?) it to be a new DR async
- having the most recent journal on this new DR async, we can promote it to be a very new primary (returning it its "native" role).
go to post
Thank you, Bob, you mostly answered my questions.
At the meantime we are (internally) discussing the worst case of complete isolation of main Data Centre, so both Member A and B can be not available. In this case the only thing we can do is to check if the DR Async we are going to promote has the more recent journal data among all other available Asyncs, right?
go to post
The version of Caché is 2015.1.4 for Windows x64, on Core i5 based laptop. When I have a time gap, I'd re-run this test on more powerful server, while I don't expect a noticeable difference.
Let's try to estimate time for both operations:
1: set max(args(i))=i ~ time_to_find_args(i)_position_in_max_array + time_to_allocate_memory_for_max(args(i)) ~ O(ln(length(args(i))) + O(length(args(i)+length(i))) ~ O(3*ln(length(args(i)))
2: max<args(i) ~ time_to_compare_max_and_args(i) ~ O(ln(length(args(i))))
So it seems that 2 should be ~3 times quicker than 1, but we don't know real coefficients which stand behind those O() estimations. I should confess that local array node allocation penalty turned to be higher than I expected.
This speed difference should be even more would args(i) values be strings rather than numbers.
go to post
Bob,
I have a couple of questions on DR Async.
1) There is an option of DR Promotion and Manual Failover with Journal Data from Journal Files
( http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=... )
where one is advised to get journal files from one of failover members.
If both failover members are not available, is it possible to get journal files from another (DR or reporting) Async member?
If so, what preliminary configuration steps one should proceed on that member to allow this option in the case of disaster?
2) Another question is on journal files collection as well. You wrote:
Is it true that Asyncs pulls journal data from Primary? If so, Primary is not aware of whether data was pulled by Async or not. Therefore, if Async was not getting journal data from Primary for several days (e.g. due to communication problems), the next unread journal file can be already purged from Primary's local storage.
Is it possible to recover from this situation without rebuilding Async? E.g., if the purged journals are available as a part of file system level backup, or those files are kept on another Async server, can it help?
==
Thanks...
go to post
I've amended the testing result due to some inaccuracy found.
New results surprised me as I didn't expect to scan a local array about 10 times faster than to create a new one.
go to post
I agree, it's elegant, but not memory efficient design: the sorted array is built only to pick up its top node. Of course, arg lists are rarely long and yet another small array doesn't mean much, but in more generic case of searching the max element in array of sufficient size the following code should be twice more memory efficient and maybe a bit faster:
P.S. I was curious if my variant was really faster and ran some tests. Here are the results that were collected using an array filled with random whole numbers. An array was refilled on each test run.
In the table below:
min - the lower limit of numbers values in the array
max- the upper limit of numbers values in the array
n - the quantity of numbers in the array
var - # of variant (1 - original, 2 - mine)
dt - average run time (in seconds)
dtfill - avg time of filling the array; just for info.
go to post
Certainly yes: SQL quering of Audit database (%SYS.Audit) won't help if Auditing is switched off.
go to post
Using a repository is a good idea for sure, but what about a solution that can help even if an 'intruder' had bypassed it and changed a class, e.g., on production server? Here is one which answers who changed SomeClassName.CLS; this code can be executed in "%SYS" namespace using System Management Portal/SQL:
It's easy to adapt it for searching the same info for MAC, INT and INC routines.
Enjoy!
go to post
Hi Anzelem,
May I ask you a couple of questions on your DR solution?
Which node would take over on Primary failure: Cache Mirror Backup or VCS secondary if both are alive?
More general: what is the main reason of mixing 2 different DR approaches?
=Thanks
go to post
Sometimes such strange results are caused by ignoring the fact that usually there are several levels of caching, from high to low:
- Caché global cache
- filesystem cache (on Linux/UNIX only, as Windows version uses direct i/o)
- hdd controller cache.
So even restarting Caché can be not enough to drop the cache for clear "cold" testing. The tester should be aware of data volume involved, it should be much more than hdd controller cache (at least). mgstat can help to figure this out, besides it can show when you start reading data mostly from global cache rather than from filesystem/hdd.
go to post
Hi Murray, thank you for keep writing very useful articles.
ECP is a rather complex stuff and it seems it does worth addition writing.
Just a quick comment to your point:
How can one distinguish journal syncs from other journal records looking at iostat log only? It seems that 0.5-1ms limit should be applied to every journal write, not only to sync records.
And a couple of small questions. You wrote that
1) "...each SET or KILL the current journal buffer is written (or rewritten) to disk. "
and
2) "On very busy systems journal syncs can be bundled or deferred into multiple sync requests in a single sync operation."
Having mgstat logs for a (non-ECP) system, is it possible to predict future journal syncs rate after scaling horizontally to ECP cluster? E.g., if we have average and peak mgstat Gloupds values, can we predict future journal syncs rate? What is the top rate of journal syncs when their bundling/deferring begins?
go to post
If so, how do you answer the curious Newbie's question: why should I use Caché at all, as a few SQL implementations are available for free nowadays?
Usually those questions were answered like this: Caché provides Unified Data Architecture that allows several access methods to the same data (bla-bla-bla), and the quickest of them is Direct Global Access. If we answer this way, we should teach how to traverse across the globals, so you are doing the very right and useful thing!
There is only one IMHO: semantics can be more difficult to catch than syntax. Whether one writes `while (1) { ... }` or `for { ... }`, it's basically all the same, while using $order or $query changes traverse algorithm a lot, and it seems that this stuff should be discussed in more details.
go to post
Google answered that there are 2 options to replace '<' in PS, please see https://blogs.technet.microsoft.com/heyscriptingguy/2011/07/16/working-around-legacy-redirection-issues-with-powershell/ .
go to post
Basic and advanced mode were in an old version of another tool named ^Buttons. With ^pButtons you have an option to reduce the number of OS commands being performed, as it was shown in Tip #4.
go to post
Good Morning, William!
To trace logon errors you should be interested in events with Name = LoginFailure
go to post
It is hard to guess what a kind of problem you have without looking at Cache Security audit records of your logon attempts.
go to post
Checking .LCK files is useless in most cases as Caché service auto-starts with OS startup. Of course, switching auto-start off is not a problem for development/testing environment.
Frank touched another interesting question: how long WaitToKillServiceTimeout should be? If we set it to ShutdownTimeout + Typical_Real_Shutdown_Time, and Caché hangs during OS shutdown, I bet that typical Windows admin won't wait 5 minutes and finish with hardware reset... Choosing between bad and worse, I'd set
WaitToKillServiceTimeout = Typical_Real_Shutdown_Time
letting OS to force Caché down in rare cases when it hangs.