Discussion
· May 17, 2021

Should we store external files in InterSystems %Stream or Windows folders

We've been tasked with developing a file upload module as part of our wider system, storing scanned documents against a patients profile. Our Intersystems manager suggested storing those files in the DB as streams would be the best approach and it sounded like a solid idea, it can be encrypted, complex indexes, optimized for large files and so on. However the stake holder questioned why would we want to do that over storing them in windows folders and that putting it in the DB was nuts. So we were wondering what everyone else has done in this situation and what made them take that route. 

Where do you primarily store your uploaded users files within your application?
Discussion (25)4
Log in or sign up to continue

The nice advantage of storing them in the DB is that is makes the following easier:

- refreshing earlier environments for testing
- mirroring the file contents
- encryption
- simpler consistent backups

However, if you're talking about hundreds of GBs of data, then you can run into issues which you should weigh against the above:

- journaling volume
- .dat size
- .dat restore time

One way to help mitigate the above for larger volume file management is to map the classes that are storing the the stream properties into their own .DAT so they can be managed separately from other application data, and then you can even use subscript level mapping to cap the size of the file .DATs.  

Hope that helps

I can't disagree with Ben, there is a cut-off point where it makes more sense to store the files external to IRIS however it should be noted that if I was working with any other database technology such as Oracle or SQL Server I wouldn't even consider storing 'Blobs' in the database. However Cache/Ensemble/IRIS is extremely efficient at storing stream data especially binary steams. 

I agree with Ben that by storing the files in the database you will have the benefits of Journallng and Backups which support 24/7 up time. If you are using Mirroring as part of your Disaster Recovery strategy then restoring your system will be faster.

If you store the files externally you will need to back up the files as a separate process from Cache/Ensemble/IRIS backups. I assume that you would have a seperate file server as you wouldn't want to keep the external files on the same server as your Cach/Ensemble/IRIS server for two reasons:

1) You would not want the files to be stored on the same disk as your database .dat files as the disk I/O might be compromised

2) If your database server crashes you may lose the external files unless they are are on separate server. 

3) You would have to backup your file server to another server or suitable media

4) If the steam data is stored in IRIS then you can use iFind and iKnow on the file content which leads you into the realms of ML, NLP and AI

5) If your Cache.dat files and the External files are sored on the same disk system you potentially run into disk fragmentation issues over time and the system will get slower as the fragmentation gets worse. Far better to have your Cache.dat files on a disk system of their own where the database growth factor is set quite high but the database growth will be contiguous and fragmentation is considerably reduced and the stream data will be managed as effectively as any other global structure in Cache/Ensemble/IRIS.

Yours

Nigel

Fragmentations issues, with SSD disks not an issue anymore. 

But in any way, I agree with storing files in the database. I have a system in production, where we have about 100TB of data, while more than half is just for files, stored in the database. Some of our .dat files by mapping used exclusively for streams, and we take care of them, periodically by cutting them at some point, to continue with an empty database. Mirroring, helps us do not to worry too much about backups. But If would have to store such amount of files as files on the filesystem, we would lose our mind, caring about backups and integrity.

I'm throwing in another vote for streams for all the reasons in the above reply chain, plus two more:

1. More efficient hard drive usage. If you have a ton of tiny files and your hard drive is formatted with a larger allocation unit, you're going to use a lot of space very inefficiently and very quickly.

2. At my previous job, we got hit by ransomware years ago that encrypted every document on our network. (Fortunately, we had a small amount of data and good offline backup process, so we were able to recover fairly quickly!) We were also using a document management solution that ran on Cache and stored the files as Stream objects, and they were left untouched. I'm obviously not going to say streams and ransomewareproof, but that extra layer of security can't hurt!

It's not a good idea to store files in the DB that you'll simply be reading back in full. The main issue you'll suffer from if you do hold them in the database (which nobody else seems to have picked up on)  is that you'll needlessly flush/replace global buffers every time you read them back (the bigger the files, the worse this will be).  Global buffers are one of the keys to performance.  

Save the files and files and use the database to store their filepaths as data and indices.

Hey Matthew,

No technical suggestions from me, but I would say that there are pros/cons to file / global streams which have been covered quite well by the other commenters. For the performance concern in particular, it is difficult to compare different environments and use patterns. It might be helpful to test using file / global streams and see how the performance for your expected stream usage, combined with your system activity, plays into your decision to go with one or the other.

I've mentioned above a system with a significant amount of streams stored in the database. And just checked how global buffers used there. And streams are just around 6%. The system is very active, including files. Tons of objects created every minute, attached files, changes in files (yeah, our users can change MS Word files online on the fly, and we keep all the versions).

So, I still see no reasons to change it. And still, see tons of benefits, of keeping it as is.

I'm not sure what you mean by this. On an IRIS instance configured with global buffers of different sizes, the different sized buffers are organized into sperate pools. Each database is assigned to a pool based on what is the smallest size available that can handle that database. If a system is configured with 8KB and 32KB buffers, the 32KB buffers could be assigned to handle 16KB database or 32KB databases but never 8KB databases.

Jeffrey, thanks. But if I would have only 16KB blocks buffer configured and with a mix of databases 8KB (mostly system or CACHETEMP/IRISTEMP) and some of my application data stored in 16KB blocks. 8KB databases in any way will get buffered in 16KB Buffer, and they will be stored one to one, 8KB data in 16KB buffer. That's correct?

So, If I would need to separate global buffers for streams, I'll just need the separate from any other data block size and a significantly small amount of global buffer for this size of the block and it will be enough for more efficient usage of global buffer? At least for non-stream data, with a higher priority?

Yes, if you have only 16KB buffers configured and both 8KB and 16KB databases, then the 16KB buffers will be used to hold 8KB blocks - one 8KB block stored in one 16KB buffer using only 1/2 the space...

If you allocate both 8KB and 16KB buffers then (for better or worse) you get to control the buffer allocation between the 8KB and 16KB databases. 

I'm just suggesting that this is an alternative to standing up a 2nd server to handle streams stored in a database with a different block size.

It depends. I would prefer to store the files in the linux filesystem with a directory structure based on a hash of the file and only store the meta-information (like filename, size, hash, path, author, title, etc) in the database. In my humble opinion this has the following advantages over storing the files in the database:

  • The restore process of a single file will run shorter than the restore of a complete database with all files.
  • Using a version control (f.e. svn or git) for the files is possible with a history.
  • Bitrot will only destroy single files. This should be no problem if a filesystem with integrated checksums (f.e. btrfs) is used.
  • Only a webserver and no database is needed to serve the files.
  • You can move the files behind a proxy or a loadbalancer to increase availability without having to use a HA-Setup of Caché/IRIS.
  • better usage of filesystem cache.
  • better support for rsync.
  • better support for incremental/differential backup.

But the pros and cons may vary depending on the size and amount of files and your server setup. I suggest to build two PoCs, load a reasonable amount of files in each one and do some benchmarks to get some figures about the performance and to test some DR- and restore-scenarios.

One more consideration for whether to store the files inside the database or not is how much space gets wasted due to the block size. Files stored in the filesystem get their size rounded up to the block size of the device. For Linux this tends to be around 512 bytes (blockdev --getbsz /dev/...). Files stored in the database as streams are probably* stored using "big string blocks". Depending on how large the streams are, the total space consumed (used+unused) may be higher when stored in a database. ^REPAIR will show you the organization of a data block. 

*This assumes that the streams are large enough to be stored as big string blocks - if the streams are small and are stored in the data block, then there will probably be little wasted space per block as multiple streams can be packed into a single data block.

In my opinion, it is much better & faster to store binary files out of the database. I have an application with hundreds of thousands of images. To get a faster access on a Windows O/S they are stored in a YYMM folders (to prevent having too many files in 1 folder that might slow the access) while the file path & file name are stored of course inside the database for quick access (using indices). As those images are being read a lot of times, I did not want to "waste" the "cache buffers" on those readings, hence storing them outside the database was the perfect solution.