As noted in the documentation the ExternalFreeze() can not last longer then 10 minutes or however long it takes you to run out of global buffers.

You could use irisstat :
iris stat [INSTNACE] | grep WDSUSPD # this will show you if the write daemon are suspended.
iris stat [INSTNACE] -W #this will resume them.

As a guess you are running out of buffers and the options are to move the snapshot to a less busy time on the system and/or increase the number of global buffers.
Look at using mgstat to find a better time to do the backup.

I think for 100% certain rollback a restore by @Ben Spead suggestion would be the way to go. While maybe if you were going back to IRIS 2024.1.1 you could just rename the install directory and install the old version I would not count on that working. I think things like journal files and IRIS.DAT get upgraded so you can not go back. You could have a mirror on the old and new version and move from the old to the new but you would still lose data since you can not mirror to a downlevel version - InterSystems IRIS Instance Compatibility
 

I would go to the WRC and ask them. Contact the WRC

Some thoughts:
Install it outside the rootvg so not in /usr or /opt
Don't bury it deep in a path. Something like /[application]/[instance]/irissys for example
Data in a separate volume group to enable snapshots and the external freeze.
Journals in a separate volume group to enable snapshots and the external freeze.
Keep instance names unique unless it is a failover member or DR async

Look at Storage Planning
 

For IP address transfer/takeover to work the network has to be the same. Other clustering solutions work like this.

Planning a Mirror Virtual IP (VIP)

If by fake you mean not the normal OS level clustering solution that is true. It is an app specific solution.

If the cost of the 2x storage is an issue maybe a deduplicating SAN would reduce the cost at the added risk not having redundant storage.

If at the OS layer you used LVM and XFS you could just add a LUN to the volume group and grow the filesystems. This can be done with everything up.
The backup solution for me determines what a large file is. For me that is anything over 1 terabyte. We use External Backups.
Most backup solutions only have one process per file. This means fewer and larger IRIS.DATs will always be slower to backup and restore than more and smaller IRIS.DATs.
The growth pattern needs to be understood. If it is going to just grow forever it has to be broken apart and it will be easier while it is small.
In your place I would upgrade to the 2024 version and explore Multivolume Databases.
8K IRIS.DATs have a max size of 32 Tb though Intersystems is working on this.
 

A low impact way to do this would be take a SAN snapshot of production and mount the snapshot in test.

Intersystems discusses this in External Backup.

The easy way to retain some database is mount them from a different set of disks/filesystem.

If you have a DR mirror but not a SAN you could just shutdown the DR mirror and do a cold backup.

If test is a separate mirror the whole test mirror will have to be refreshed due to mirror headers. If test is part of the same mirror as production then the data is already there.

In two really key ways multi-volumes datasets don't allow you to escape limits you might want to escape.
The max size of a multi-volume dataset is the same as a single file - 34 TB (33553904 MB) for 8K database. Intersystems is working to increase the max size.
An integrity check also treats it as if it was a single file. This is only going to be more of an issue as the max size increses.