What is a core file, and when are they useful?

Caché and InterSystems other products based upon Caché are very reliable. The vast majority of our customers never experience any kind of failure. However, under rare conditions Caché has failed, and in doing so may have produced a core file (called a dump file on Windows and OpenVMS). The core file contains a detailed copy of the state of the process at the instant of its failure, including the processes registers, and memory (including or excluding shared memory depending upon configuration details).

The core file is in essence an instantaneous picture of a failing process at the moment it attempts to do something very wrong. From this picture we can extrapolate backwards in time in an effort to find the initial mistake that led to the failure. As we look back in time, our picture of the process becomes more and more fuzzy. With more detailed cores, we can look farther back in time before the picture becomes too fuzzy.

With properly collected core files, and associated information, we can often solve, and otherwise extract valuable information about the failing process. With an artificially induced core file, usually all we can say (often after hours of analysis) is “I see what happened to this process, someone artificially forced a core of the process.” An artificially induced core of a misbehaving but extant process can be useful as a secondary source of information to fill in details of an analysis gathered from information not available in the core.

Caché can be configured to record full cores on any process failure. This has no impact upon performance while you are running. Generally all you need is to keep a significant amount of disk space free for any potential, albeit unlikely failure. InterSystems has a good record of solving problems when a full core is available. Of course sometimes we discover it was an obscure and unavoidable hardware failure that is never going to occur again.

Caché can also be configured to record little or no information for a process failure. While there is no performance advantage to disabling cores, you might find an operational advantage. Cores can contain sensitive information. If you don’t want to have a policy for securing core files, you can enable core files only after repeated failures.

Out of the box, Caché is generally installed with an intermediate approach. Limited size cores, with which InterSystems can normally identify a previously solved problem, and maybe solve simple problems, but we can’t solve all problems with the default limited cores.

The primary control for determining the size and type of core you will get is DumpStyle. There are a number of other Operating System specific controls.

DumpStyle is explained here: <http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RCPF_Dumpstyle>. DumpStyle takes an integer value from 0 to 4 that applies to every process in a Caché instance, and defines what kind of core (or dump) is saved should a Caché process encounter a serious error. The defined values are

Code

Name

Results

0

NORMAL

UNIX: produces full core (depending upon other settings)

OpenVMS: produces CACCVIO-pid.LOG (of limited value)

Windows: produces pid.dmp (of limited value)

1

FULL

UNIX: produces full core (depending upon other settings)

OpenVMS: produces CACHE.DMP (possibly very large)

Windows: produces cachefpid.dmp (possibly very large)

2

DEBUG

UNIX: prior to Caché 2014.1, produced core with shared memory omitted, now deprecated. Best to use OS specific methods to omit shared memory.

OpenVMS: Unimplemented.

Windows: Reserved to InterSystems.

3

INTERMEDIATE

Unix and OpenVMS: Unimplemented.

Windows: Effective 2014.1, produces cacheipid.dmp.

4

MINIMAL

Unix and OpenVMS: Unimplemented.

Windows: Effective 2014.1, produces cachempid.dmp.

The default DumpStyle is 0 = NORMAL, except on Windows since Caché 2014.1, where it is 3 = INTERMEDIATE.

So, for this control, set it as follows:

Limited cores

Intermediate cores

Full cores

UNIX

0

0

1 (other controls apply)

OpenVMS

0

0

1

Windows

4

3

1

There are three ways to change the value of DumpStyle. They are:

  1. Place this section in your cache.cpf file, you will need to use your Operating Systems text editor for this:

    [Debug]
    dumpstyle=1

    The number after the equals sign is the new default DumpStyle. Restart Caché. This is effective for all processes, and defines a new default for all processes, if you don’t override with method (2) or (3) below.

  2. Issue the command:

    SET old=$SYSTEM.Config.ModifyDumpStyle(1)

    The number in parenthesis is the new value for DumpStyle. The old value is returned. This command is effective for all new processes created after it is run. Existing processes continue to run with their prior DumpStyle.

    This command became effective with Caché 2014.1. For older versions, you can use this command:

    VIEW $ZUTIL(40,2,165):-2:4:1

    Where the new value for DumpStyle is the final digit.

  3. Issue this command, or place it in your application:

    VIEW $ZUTIL(40,1,48):-1:4:1

    Where the new value for DumpStyle is the final digit. This is effective only for the process issuing the command, and overrides method (1) and (2).

Operating System Specific Details

Most Operating Systems have their own controls to redirect cores to a common directory, and control the amount of information in cores. These too need to be set, and you should consider the ramifications for doing so, especially from a data privacy perspective.

Moving cores to a common directory is very useful in capacity planning, but may also make the cores more accessible to anyone wishing to exfiltrate data from your site.

There are many types of problems that simply cannot be solved without shared memory included in the core. Cores that include shared memory tend to be much larger than cores that do not include shared memory, by the factor of the size of your global and routine buffers.

If you are processing sensitive information, a core file without shared memory will contain the information that process was processing, while a core file with shared memory will also contain all the global variables all processes recently accessed. Where recently could represent minutes, or considerably longer.

AIX

Full (and modern style) cores should be enabled with smit

System Environments
>  Change / Show Characteristics of Operating System
> >  Enable full CORE dump                               true
> >  Use pre-430 style CORE dump                         false

This can also be seen from the command line with:

# lsattr -E -l sys0 | egrep 'fullcore|pre430core'
fullcore     true            Enable full CORE dump                 True
pre430core   false           Use pre-430 style CORE dump           True

And set with:

# chdev -l sys0 -a fullcore=true -a pre430core=false -P

The -P makes the change permanent.

By default core files are written to the default directory of the process at the time of process failure. Typically that is the same directory as one of your main CACHE.DAT file. This can be changed with smit:

Problem Determination
> Change/Show/Reset Core File Copying Directory

or from the command line with:

# chcore -p on -l /cores -n on -d

Insure the file /etc/security/limits, has a section with the line

default:
core = -1

Finally, ensure that by whatever means you setup environment variables for user processes each user has CORE_NOSHM defined or not defined as desired.

CORE_NOSHM=1;export CORE_NOSHM

 # sh in /etc/profile or $HOME/.profile

export CORE_NOSHM=1

 # ksh in /etc/.kshrc or $HOME/.kshrc

export CORE_NOSHM=1

 # bash in /etc/bashrc or ~/.bashrc

setenv CORE_NOSHM 1

 # csh in ~/.cshrc

HP-UX

You can enable placing cores in a common directory with extended naming with

# coreadm -e global -g /cores/core.%p.%f

%p places the pid in the pathname, %f places the name of the executable (such as cache) in the pathname. See:

% man 1m coreadm

for more options.

Review if shared memory has been enabled in core files with:

# /usr/sbin/kctune core_addshmem_read
# /usr/sbin/kctune core_addshmem_write

Change with

# /usr/sbin/kctune core_addshmem_read=1
# /usr/sbin/kctune core_addshmem_write=1

1 means enable, 0 means disable. HP-UX divides shared memory into two types. In general Caché only uses write shared memory, but we recommend setting both types the same.

On HP-UX the core size is limited by the maxdsiz_64bit kernel parameter. Make sure they it is set high enough that a full core can be generated.

Review with

# /usr/sbin/kctune maxdsiz_64bit

Set with

# /usr/sbin/kctune maxdsiz_64bit=4294967296

A user can further limit their core with a ulimit -c command. This command should be removed from /etc/profile, $HOME/.profile, and similar files for other shells unless it is your intention to limit core files.

LINUX

If you are running RHEL 6.0 or later (also CentOS), Red Hat has added their Automatic Bug Reporting Tool (ABRT). As installed this is not compatible with Caché. You need to decide if you wish to configure ABRT to support Caché, or disable ABRT. To make it compatible:

Determine the version of ABRT you are running:

# abrt-cli --version

Edit the ABRT configuration file. The name varies depending upon the version of ABRT:

ABRT 1.x:

 /etc/abrt/abrt.conf

ABRT 2.x:

 /etc/abrt/abrt-action-save-package-data.conf

If you installed Caché with a cinstall command (most common), find the ProcessUnpackaged= line, and change the value to yes

ProcessUnpackaged = yes

Otherwise, if you installed Caché from an RPM module, find the OpenGPGCheck= line, and change the value to no.

OpenGPGCheck = no

Regardless of how you installed Caché, find the BlackListedPaths= lines, and add a reference to cstat in the Caché install directory. If the BlackListedPaths= line does not exist, add it at the end with just the cstat reference.

BlackListedPaths=retain_existing_list,installation_directory/bin/cstat

Save your edits, and restart abrtd:

# service abrtd restart

Configured as such, Caché and ABRT will create a new directory (generally under /var/spool/abrt or /var/tmp/abrt) for each process failure, and in that directory, place the core, and associated information.

When a crash happens, issue the command

for ABRT 1.x:

 # abrt-cli --list

for ABRT 2.x:

 # abrt-cli list

This will show a list of recent process failures, and for it is will give a directory specification. In that directory will be a coredump file, along with a number of other small files that collectively can be quite useful in determining the cause of the process failure.

From some other directory, enter the command

% tar -cvzf wrcnumbercore.tar.gz /var/spool/abrt/directory/*

You can send us the compressed wrcnumbercore.tar.gz file.

Alternatively, you can disable ABRT with:

# service abrtd stop

# service abrt-ccpp

 # ABRT 2.x only.

To permanently disable ABRT:

# chkconfig abrtd off

# chkconfig abrt-ccpp off

  # ABRT 2.x only.

Finally you need to update /proc/sys/kernel/core_pattern, see the next section.

You can control where cores are deposited (unless you are using ABRT).

  • If you are using ABRT, you must skip this step.

  • If you have disabled ABRT, you must perform this step.

  • If you never had ABRT, this step is optional.

Edit the file /proc/sys/kernel/core_pattern

In the simple case, just use

core

It is generally useful to add the pid, and name of the program generating the core with

core.%p.%e

You might also place the cores in a common directory with:

/cores/core.%p.%e

See “man core” for more naming options.

You should set the /proc/self/coredump_filter to control the amount of memory dumped to the core. This can be in an appropriate /etc/profile.d/something.sh file. The command is

# echo 0x33 >/proc/self/coredump_filter

The exact bitmap used depends upon the level of data you wish to collect. The meanings of the bits can be found in “man core”, samples that make sense for Caché are:

Bit

Description

Need for Caché

0x01

Anonymous private mappings

Always needed

0x02

Anonymous shared mappings

Needed for complex problems

0x04

File-backed private mappings

Maybe needed for problems with $ZF()

0x08

File-backed shared mappings

Maybe needed for problems with $ZF()

0x10

Dump ELF headers

Always needed

0x20

Dump private huge pages

Not currently used by Caché

0x40

Dump shared huge pages

Not currently used by Caché

0x80

Not currently defined

You should set your ulimit -c for all processes to unlimited. This can be placed in an appropriate /etc/profile.d/something.sh file.

ulimit -c unlimited

macOS (OS X, Darwin)

For versions OS X 10.4 (Tiger) to OS X 10.9 (Mavericks), edit the file /etc/launchd.conf, and add the line

limit core unlimited

And reboot.

For versions OS X 10.10 (Yosemite) and newer, launchd is eliminated, and core file generation is not disabled at the system level by default. However, it can be disabled, by editing /etc/sysctl.conf, and inserting this line

kern.coredump=0

It can be re-enabled, by removing the line, or changing the value to 1.

It may still be necessary to increase your ulimit for each user. This may be placed in any of /etc/profile, $HOME/.profile, /etc/bashrc, or ~/.bashrc.

ulimit -c unlimited

OpenVMS

By default Caché will only produce CACCVIO-pid.LOG files for failing processes. With these only relatively simple problems can be solved. These CACCVIO-pid.LOG files will always be placed in the processes default directory (typically the directory of a CACHE.DAT file), and can only be redirected by changing the processes default directory.

If extended process dumps (FULL dumps) are enabled, they too will be placed in the process default directory. However they can be redirected, by defining the logical name SYS$PROCDMP to point to a directory in which to store the process dump. This logical name can be defined at the /SYSTEM level. The file name will be CACHE.DMP or CSESSION.DMP.

OpenVMS also provides the logical name SYS$PROTECTED_PROCDMP. You should also define that logical name with both /EXECUTIVE_MODE and /SYSTEM. This applies to process failures of privileged images, and parts of Caché are privileged. The OpenVMS documentation will advise you to define the two logical names to different directories, and place higher security on directory corresponding to SYS$PROTECTED_PROCDMP. This is based upon the assumption that that the data processed by privileged images is more sensitive than that processed by non-privileged images. If both are sensitive, it is ok to point both logical names to the same directory.

Solaris

You can enable placing cores in a common directory with extended naming with

# coreadm -e global -g /cores/core.%p.%f -G all

%p places the pid in the pathname, %f places the name of the executable (such as cache) in the pathname.

The -G all includes all types of memory, that is a full core. Omit this for a default core that still includes most shared memory. The following things can be stored in the core:

Code

Caché usage

In Default

stack

Needed

yes

heap

Needed

yes

shm

Not used

yes

ism

Not used

yes

dism

Caché shared memory

yes

text

Useful for $ZF() failures

yes

data

Needed

yes

rodata

Not used

yes

anon

Needed

yes

shanon

Generally small

yes

ctf

needed

yes

symntab

Useful for $ZF() failures

no

shfile

Not used

no

all includes all types of memory, default includes all but the last two. If you want significantly smaller cores (to save space at the expense of making fewer problems solvable), the most space is saved by removing dism shared memory. Do this with

# coreadm -e global -g /cores/core.%p.%f -G (default-dism)

See:

% man 1m coreadm

for more options.

By default users have

% ulimit -c unlimited

You may use the ulimit (or limit command in csh) to disable cores, but coreadm is generally more flexible. So you should insure ulimit commands don’t appear in /etc/profile or $HOME/.profile, or corresponding files for other shells.

Windows

The information to be included in a dumpfile for Windows is fully controlled by the DumpStyle parameter in the cache.cpf file (or other interface to changing DumpStyle defined above.

Testing

Local security setup among other problems can prevent a core from actually being written. It can be very useful to test if a core will actually be created under real-world conditions. To do that, enter the command:

DO $ZUTIL(150,"DebugException")

To be certain, you should test this statement interactively, inside JOBs (assuming your application uses the JOB command), and even hiding inside a option of your application that your users will not accidentally select. Verify that you get a core file, and follow the sanity check in the next section to verify that it is a good core file.

Responding to a core

Should you be so unfortunate as to experience a process failure, we are sorry, and we will do our best to get to the bottom of it. These are three things you should do:

  1. Sanity Test. If you have a core or other evidence of a process failure, you may have evidence of a new problem. You may also have evidence of previously discovered problem, or evidence that your core collections themselves are broken. We can investigate the later two cases with a simple look at your core file that is best performed on the machine that generated the core. Based upon your operating system, do:

    Operating System

    Sanity test

    AIX

    # dbx cache core
    (dbx) set $stack_details
    (dbx) where
    (dbx) quit

    Send us the output from the above commands when opening a problem with the WRC. If you do not have dbx installed on your system, just open a new problem.

    HP-UX

    # gdb cache core
    (gdb) frame 0
    (gdb) while 1
     > info frame
     > up
     > end
    (gdb) quit

    # adb cache core
    adb> $c
    adb> $q

    Send us the output from one of the two command sets above depending upon which debugger you have available. If you have both, gdb (actually Wildebeest) is preferred.

    LINUX

    # gdb cache core
    (gdb) frame 0
    (gdb) while 1
     > info frame
     > up
     > end
    (gdb) quit

    Send us the output from the above commands when opening a problem with the WRC. If you do not have gdb installed on your system, just open a new problem.

    macOS (OS X, Darwin)

    # lldb
    (lldb) target create -c core
    (lldb) thread backtrace all
    (lldb) quit

    # gdb cache core
    (gdb) frame 0
    (gdb) while 1
     > info frame
     > up
     > end
    (gdb) quit

    Send us the output from lldb with macOS since version 10.8. For macOS (OS X) version 10.7 and prior, send the gdb output.

    OpenVMS

    $ ANALYZE/CRASH dumpfile.dmp
    SDA> SHOW CALL_FRAME/ALL

    If you are still running OpenVMS v7.x (or earlier), the previous command will not work, instead use:
    SDA> SHOW CALL_FRAME
    SDA> SHOW CALL_FRAME/NEXT

    (Repeat the prior command until you get an error.)
    SDA> QUIT

    $ ANALYZE/PROCESS dumpfile.DMP
    DBG> SHOW CALL/IMAGE
    DGB> QUIT

    Send us the output from either SDA or the debugger, but the output from SDA is preferred. If you only have a CACCVIO-pid.LOG file, check that it is not empty or almost empty.

    Solaris

    # mdb cache core
    > ::stackregs
    > ::quit

    # dbx cache core
    (dbx) where
    (dbx) quit

    For almost every application, dbx is the preferred debugger on Solaris, but for a sanity test, mdb is better. Sent us the stack trace produced by mdb or dbx (mdb preferred) when you open a problem report with the WRC.

    Windows

    Currently there is no recommended sanity check for Windows process dumps.

  2. Open a problem with the WRC. A process failure should never be regarded as a mere nuisance. Caché should never produce a core (or dump) file. This is a sign of a problem that we would like to investigate.

    Call

     +1–617–621–0700, or

     +44 (0) 844 854 2917 in the UK

     0800 888 22 11 in Brasil

     +49 (0) 6151-17 47-47 in Germany

    e-mail the details of the sanity test to <support@intersystems.com>.

  3. Be prepared to send us the full core along with support files that may be needed for your particular operating system. It is important to remember that some of the files we want are binary files, while others are text. For most file transfer methods (especially between unlike operating systems), it is important to specify if the file is binary or text to prevent the file from being corrupted. It is also important to consider the level of security needed to guard these core files. We will provide you with an account on our sftp upload server. The core will be encrypted on the wire, and we will limit the machines inside InterSystems onto which the file is copied, and who has access. If you desire tighter access control, please let us know.

    Operating System

    What to send

    RHEL Linux with ABRT enabled

    The contents of the /var/spool/abrt/directory/ compressed into the file wrcnumbercore.tar.gz (binary) as explained above.

    UNIX (all other flavors)

    Issue the command

    % sudo ldd install_directory/bin/cache

    Place the core file, the cache executable itself, and all the library files, listed into a .tar.gz file, and send us that .tar.gz file (binary). Please also report the specific version of UNIX involved (uname -a). See “do I really need to send my libraries” below.

    OpenVMS

    Please send the dump file (CACHE.DMP or CSESSION.DMP), along with the full $ZVERSION string, and the version of OpenVMS you are using (SHOW SYSTEM/NOPROC). If all you have is a CACCVIO-pid.LOG file, please send that.

    Windows

    Please send the dump file (pid.dmp, cachempid.dmp, cacheipid.dmp, cachefpid.dmp). If all you have is the mini-dump contained in the cconsole.log file, please send the cconsole.log from before the failure until after the failure.) If you have relinked Caché (to add $ZF() calls, or if you are not sure if you have relinked Caché, send the cache.exe file from the \bin directory. If you don’t send the cache.exe, send the full $ZVERSION string. Please also report the full version of Windows you are using.

Do I really need to send my libraries? It is best to always send the libraries for UNIX systems, but how important they are for a quick and correct analysis of the core depends upon a number for factors. The two most important factors are, were the libraries involved in the process failure (many memory access failures occur inside “C” memory move an compare functions), and how easy is it to parse the run time stack used by the supported platform.

For example AIX (on PowerPC) and Solaris (on SPARC) can be reliably parsed, so libraries are only needed in some cases. Other current platforms can often be parsed using heuristics, but that can take considerably longer. In general we don’t have heuristics for older, best-effort only platforms like (Tru64 UNIX and SCO UNIX). So in those cases libraries are essential.

Do I really need to send my cache executable? We maintain a copies of all software we release, so in theory no. However, on all platforms but OpenVMS we allow our customers to rebuild the cache executable to include additional functions available by $ZF() calls. We don’t have a copy of this, and can’t analyse your core without it. Also the core does not always record the exact version of Caché used to record it, so pulling the exact version from our library is not trivial.

Comments