Should we store external files in InterSystems %Stream or Windows folders |

Ben Spead · May 17, 2021

The nice advantage of storing them in the DB is that is makes the following easier:

- refreshing earlier environments for testing
- mirroring the file contents
- encryption
- simpler consistent backups

However, if you're talking about hundreds of GBs of data, then you can run into issues which you should weigh against the above:

- journaling volume
- .dat size
- .dat restore time

One way to help mitigate the above for larger volume file management is to map the classes that are storing the the stream properties into their own .DAT so they can be managed separately from other application data, and then you can even use subscript level mapping to cap the size of the file .DATs.

Hope that helps

4 0

Nigel Salm · May 17, 2021

I can't disagree with Ben, there is a cut-off point where it makes more sense to store the files external to IRIS however it should be noted that if I was working with any other database technology such as Oracle or SQL Server I wouldn't even consider storing 'Blobs' in the database. However Cache/Ensemble/IRIS is extremely efficient at storing stream data especially binary steams.

I agree with Ben that by storing the files in the database you will have the benefits of Journallng and Backups which support 24/7 up time. If you are using Mirroring as part of your Disaster Recovery strategy then restoring your system will be faster.

If you store the files externally you will need to back up the files as a separate process from Cache/Ensemble/IRIS backups. I assume that you would have a seperate file server as you wouldn't want to keep the external files on the same server as your Cach/Ensemble/IRIS server for two reasons:

1) You would not want the files to be stored on the same disk as your database .dat files as the disk I/O might be compromised

2) If your database server crashes you may lose the external files unless they are are on separate server.

3) You would have to backup your file server to another server or suitable media

4) If the steam data is stored in IRIS then you can use iFind and iKnow on the file content which leads you into the realms of ML, NLP and AI

5) If your Cache.dat files and the External files are sored on the same disk system you potentially run into disk fragmentation issues over time and the system will get slower as the fragmentation gets worse. Far better to have your Cache.dat files on a disk system of their own where the database growth factor is set quite high but the database growth will be contiguous and fragmentation is considerably reduced and the stream data will be managed as effectively as any other global structure in Cache/Ensemble/IRIS.

Yours

Nigel

3 0

Dmitry Maslennikov · May 17, 2021

Fragmentations issues, with SSD disks not an issue anymore.

But in any way, I agree with storing files in the database. I have a system in production, where we have about 100TB of data, while more than half is just for files, stored in the database. Some of our .dat files by mapping used exclusively for streams, and we take care of them, periodically by cutting them at some point, to continue with an empty database. Mirroring, helps us do not to worry too much about backups. But If would have to store such amount of files as files on the filesystem, we would lose our mind, caring about backups and integrity.

2 0

Ben Spead · May 17, 2021

Great data point! Thanks @Dmitry Maslennikov :)

0 0

David Hockenbroch · May 17, 2021

I'm throwing in another vote for streams for all the reasons in the above reply chain, plus two more:

1. More efficient hard drive usage. If you have a ton of tiny files and your hard drive is formatted with a larger allocation unit, you're going to use a lot of space very inefficiently and very quickly.

2. At my previous job, we got hit by ransomware years ago that encrypted every document on our network. (Fortunately, we had a small amount of data and good offline backup process, so we were able to recover fairly quickly!) We were also using a document management solution that ran on Cache and stored the files as Stream objects, and they were left untouched. I'm obviously not going to say streams and ransomewareproof, but that extra layer of security can't hurt!

1 0

Matthew Waddingham · May 18, 2021

Thank you all for your input, they're all sound reasoning that I can agree with.

0 0

Rob Tweed · May 18, 2021

It's not a good idea to store files in the DB that you'll simply be reading back in full. The main issue you'll suffer from if you do hold them in the database (which nobody else seems to have picked up on) is that you'll needlessly flush/replace global buffers every time you read them back (the bigger the files, the worse this will be). Global buffers are one of the keys to performance.

Save the files and files and use the database to store their filepaths as data and indices.

4 0

Matthew Waddingham · May 18, 2021

Hi Rob, what factors play a part in this though, we'd only be retrieving a single file at a time (per user session obviously) and the boxes have around 96gb-128gb memory each (2 app, 2 db) if that has any effect on your answer?

0 0

Vic Sun · May 18, 2021

Hey Matthew,

No technical suggestions from me, but I would say that there are pros/cons to file / global streams which have been covered quite well by the other commenters. For the performance concern in particular, it is difficult to compare different environments and use patterns. It might be helpful to test using file / global streams and see how the performance for your expected stream usage, combined with your system activity, plays into your decision to go with one or the other.

1 0

Matthew Waddingham · May 18, 2021

I agree, for our own trust we'll most likely go with Stream. However I've suggested we plan to build both options for customers but we'll just reference the links to files and then they can implement back up etc as they see fit.

2 0

Vic Sun · May 18, 2021

Great! This was an interesting topic and I'm sure one that will help future viewers of the community.

1 0

Dmitry Maslennikov · May 18, 2021

I've mentioned above a system with a significant amount of streams stored in the database. And just checked how global buffers used there. And streams are just around 6%. The system is very active, including files. Tons of objects created every minute, attached files, changes in files (yeah, our users can change MS Word files online on the fly, and we keep all the versions).

So, I still see no reasons to change it. And still, see tons of benefits, of keeping it as is.

2 0

Eduard Lebedyuk · May 19, 2021

Global buffers are one of the keys to performance.

Yes, that's why if streams are to be stored in the db they should be stored in a separate db with distinct block size and separate global buffers.

2 0

Dmitry Maslennikov · May 19, 2021

Having multiple different global buffers for different block sizes, does not make sense. IRIS will use bigger size of block for lower size blocks inefficiently. The only way to separate is, to use a separate server, right for streams.

0 0

Jeffrey Orlin · May 19, 2021

I'm not sure what you mean by this. On an IRIS instance configured with global buffers of different sizes, the different sized buffers are organized into sperate pools. Each database is assigned to a pool based on what is the smallest size available that can handle that database. If a system is configured with 8KB and 32KB buffers, the 32KB buffers could be assigned to handle 16KB database or 32KB databases but never 8KB databases.

2 0

Dmitry Maslennikov · May 20, 2021

Jeffrey, thanks. But if I would have only 16KB blocks buffer configured and with a mix of databases 8KB (mostly system or CACHETEMP/IRISTEMP) and some of my application data stored in 16KB blocks. 8KB databases in any way will get buffered in 16KB Buffer, and they will be stored one to one, 8KB data in 16KB buffer. That's correct?

So, If I would need to separate global buffers for streams, I'll just need the separate from any other data block size and a significantly small amount of global buffer for this size of the block and it will be enough for more efficient usage of global buffer? At least for non-stream data, with a higher priority?

0 0

Jeffrey Orlin · May 20, 2021

Yes, if you have only 16KB buffers configured and both 8KB and 16KB databases, then the 16KB buffers will be used to hold 8KB blocks - one 8KB block stored in one 16KB buffer using only 1/2 the space...

If you allocate both 8KB and 16KB buffers then (for better or worse) you get to control the buffer allocation between the 8KB and 16KB databases.

I'm just suggesting that this is an alternative to standing up a 2nd server to handle streams stored in a database with a different block size.

2 0

Eduard Lebedyuk · May 18, 2021

There are a lot of considerations.

Questions:

Can you describe what are you going to do with that streams (or files I guess)?
Are they immutable?
Are they text or binary?
Are they already encrypted or zipped?
Average stream size?

0 0

Matthew Waddingham · May 19, 2021

For us it will be scanned documents (to create a more complete picture of a patients record in one place) so we can estimate a few of the constants involved to test how it will perform under load.

0 0

Jan Fader · May 19, 2021

It depends. I would prefer to store the files in the linux filesystem with a directory structure based on a hash of the file and only store the meta-information (like filename, size, hash, path, author, title, etc) in the database. In my humble opinion this has the following advantages over storing the files in the database:

The restore process of a single file will run shorter than the restore of a complete database with all files.
Using a version control (f.e. svn or git) for the files is possible with a history.
Bitrot will only destroy single files. This should be no problem if a filesystem with integrated checksums (f.e. btrfs) is used.
Only a webserver and no database is needed to serve the files.
You can move the files behind a proxy or a loadbalancer to increase availability without having to use a HA-Setup of Caché/IRIS.
better usage of filesystem cache.
better support for rsync.
better support for incremental/differential backup.

But the pros and cons may vary depending on the size and amount of files and your server setup. I suggest to build two PoCs, load a reasonable amount of files in each one and do some benchmarks to get some figures about the performance and to test some DR- and restore-scenarios.

2 0

Jeffrey Orlin · May 20, 2021

One more consideration for whether to store the files inside the database or not is how much space gets wasted due to the block size. Files stored in the filesystem get their size rounded up to the block size of the device. For Linux this tends to be around 512 bytes (blockdev --getbsz /dev/...). Files stored in the database as streams are probably* stored using "big string blocks". Depending on how large the streams are, the total space consumed (used+unused) may be higher when stored in a database. ^REPAIR will show you the organization of a data block.

*This assumes that the streams are large enough to be stored as big string blocks - if the streams are small and are stored in the data block, then there will probably be little wasted space per block as multiple streams can be packed into a single data block.

1 0

Dmitry Maslennikov · May 20, 2021

Some info about blocks, in this article and others in cycle.

1 0

Yaron Munz · May 31, 2021

In my opinion, it is much better & faster to store binary files out of the database. I have an application with hundreds of thousands of images. To get a faster access on a Windows O/S they are stored in a YYMM folders (to prevent having too many files in 1 folder that might slow the access) while the file path & file name are stored of course inside the database for quick access (using indices). As those images are being read a lot of times, I did not want to "waste" the "cache buffers" on those readings, hence storing them outside the database was the perfect solution.

1 0

Liam Jackson · May 31, 2021

Hi, I keep everything I need in Windows folders, I'm very comfortable, I have everything organized. But maybe what you suggest won't look bad and be decent in terms of convenience!

0 0

Huang Zhi · Apr 5, 2022

it depends on the file type , content, use frequency and so on , each way has its advantage

0 0