Michael Gosselin · May 22, 2018

Faster finds in large Unix directories

We have this challenge at our site. When we first designed it many years ago, we decided that the best way to store files was with a unique identifier, which matched one of the fields in the corresponding record. For example, if the unique identifier was a nine-digit field (such as a SSN), we'd save a file as nnnnnnnnn.ext, where nnnnnnnnn is the nine-digit number, and ext is the file extension. If we needed a change to the file, we'd file as nnnnnnnnn_hdate.ext, where hdate is the horolog date. And for 18 years, this was just fine.

However, in year 19, we discovered one of the growing pains of success. We had always searched files in the directory with a call from $ZF call from Cache. To find all the relevant files, we write something like S X=$ZF(ls -1 /filepath/nnnnnnnnn* >output.txt") to list all the files in that directory into a single file we'd read and process. However, at some point, the searches started taking longer, and we'd get CSPs timing out after 60 seconds, and complaints from the customer service people that our searches were too slow.

To get around this, short-term, we moved every file from /filepath to /filepath_old, deleted the directory /filepath, then renamed /filepath_old to /filepath. This worked for about three months, then the searches started getting slower... again.

One idea we did have was to create sub-directories within the directory to sort the files. This would work for some cases, but then gets problematic when the search is looking for a date range. Not only that, we'd need to have an effective way to pass the unique identifier, and create the proper directory name for lookup to pass to the $ZF call, or even to any %File calls.

So, I am asking for help and guidance on faster lookups, and better directory structures, for Cache with our Unix system. If you've had a similar problem, or can shed some light onto this, I would be most appreciative.

Thank you.

0 225
Discussion (2)1
Log in or sign up to continue

I remember a similar situation some years back with a rather sophisticated multilevel directory structure on UNIX.

The final solution, especially for al kind of searching in unique file names, was a class
with the filename as ID and directory, summary, creation date, last modification as properties.
The search (in SQL) out performed anything used before.

The only extra work was kind of a register at file creation/modification which happened at a moderate rate.  
+ a nightly batch job to verify and do the reality check. 

There are two use cases here:

  • File is tied to a specific object (for example you have "Document" class and it has "scan" file). In that case you can use %FileBinaryStream property - as before getting the file you would probably open "Document" object first
  • File is not tied to a specific object. In that case you can create a separate table "Files" that stores
    • link to file as a FileBinaryStream
    • hash
    • displayed file name
    • file path
    • user who uploaded the file
    • extension
    • size
    • any other attributes you need

It would always work faster compared to OS search.

Other notes:

  • Files are immutable - if you're building an application where user can edit files, it's usually preferable to have immutable "files" objects and just create new file versions.
  • File size limits - always define and check for maximum size.
  • Extension limits - limit extensions user can upload.
  • Storage - if it's a low volume inserts (<1000/day) store files in a folder = date, otherwise generate a new folder for each new thousand of files. These approaches can be combined: date/1, date/2 ...
  • Hash name - I often store files where their OS name is their hash. This way I can quickly validate that file is valid and also it  solves the problems with non-latin characters.
  • Never store files under names supplied by user. Acceptable filenames are: guid, hash (integer id should also be avoided).
  • GZIP - in some cases using GZIP streams can save on space, especially if it's a text file. For example XML envelopes and such.