How do I get a list of files in directory including subdirectories?

I'm aware of two ways to get list of files in a dir:

set dir = "C:\temp\"
set rs = ##class(%File).FileSetFunc(dir, , , 1)
do rs.%Display()

and:

set dir = "C:\temp\"
set file=$ZSEARCH(dir_"*")
while file'="" {
   write !,file
   set file=$ZSEARCH("")
}

Yet they bot return only files and directories in a current directory, but not files in subdirectories.

I suppose I call one of these recursively, but maybe there's a better solution?

  • + 1
  • 1
  • 1333
  • 21
  • 4

Answers

Unless you're asking for maximum performance you could create a method that receives or creates a %SQL.Statement for the %File:FileSet query and detect if "Type" is "D" to call it recursively, passing that statement instance.

Here's a use case that I applied that pattern:

Method SearchExtraneousEntries(
statement As %SQL.Statement = "",
path As %String,
ByRef files As %List = "")
{
  
  if statement = "" {
    set statement = ##class(%SQL.Statement).%New()
    $$$QuitOnError(statement.%PrepareClassQuery("%File", "FileSet"))
  }
  
  set dir = ##class(%File).NormalizeDirectory(path)
  set row = statement.%Execute(dir)
  set sc = $$$OK   
  
  while row.%Next(.sc) {
    if $$$ISERR(sc) quit
    set type = row.%Get("Type")
    set fullPath = row.%Get("Name")
    
    if ..IsIgnored(fullPath) continue
            
    if type = "D" {      
      set sc = ..SearchExtraneousEntries(statement, fullPath, .files)
      if $$$ISERR(sc) quit
    
    
    if '..PathDependencies.IsDefined(fullPath) {
      set length = $case(files, "": 1, : $listlength(files)+1)
      set $list(files, length) = $listbuild(fullPath, type)
    }
  }
  quit sc
}

I'm voting for Rubens's solution as it is OS independent. Caché as a great sandbox in many cases, why not use its "middleware" capabilities? Developer's time costs much more than CPU cycles, and every piece of OS dependent code should be written and debugged separately for each OS that should be supported.

As to performance, in this very case I doubt if the recursion costs much compared to system calls. Anyway, it's not a great problem to replace it with the iteration.

 

 

Initially the question was about alternative ways of solving the issue (in addition to recursive FileSet and $ZSEARCH).

I just proposed a third method, namely using the capabilities of the OS itself. Maybe someone here didn't know about it.

Which option at end to choose - to solve the developer.


We here vote for the best solution or in general for offered solutions?

If the first, then I'll pass.

I don't understand the difference between this two kinds of voting :) Which solution is the best, depends on many factors: if we need turbo performance, we'd take your approach, if not - %ResultSet based one. BTW, I guess that file dirs scanning is a small part of a bigger task and those files are processed after they have been searched and the processing takes much longer than the directory search.

Last 2c for cross-platform approach: the main place where COS developer faces problems is interfacing with 3d party software. As I was told by one German colleague, "Cache is great for seamless integration".

E.g., I've recently found that forcely resetting of LD_LIBRARY_PATH by Cache for Linux may cause problems for some utilities on some Linux versions. It's better to stop here, maybe I'll write about it separately.

 

What is ..IsIgnored and ..PathDependencies? You could post a complete example?

Here.

If you need an explanation, the method I used for example is part of an algorithm that keeps the repository in-sync with the 

the project.

You're right, your approach is the best, even though it's not cross-platform. Such issue could be solved by using $$$isUNIX

and $$$isVMS though.

Something like:

if $$$isUNIX set command = "find %1"

<...> but maybe there's a better solution?

Yes, of course.

E.g. (for Windows x64):

#include %systemInclude

; see "dir /?"
##class(%Net.Remote.Utility).RunCommandViaCPIPE($$$FormatText("dir /A-D /B /S %1",$$$quote("C:\Program Files (x86)\Common Files\InterSystems")),,.con)
; string -> array
##class(%ListOfDataTypes).BuildValueArray($lfs($e(con,1,*-2),$$$NL),.array)
zw array

Nice solution, not seen RunCommandViaCPIPE used before.

Looking at the documentation it says...

"Run a command using a CPIPE device. The first unused CPIPE device is allocated and returned in pDevice. Upon exit the device is open; it is up to the caller to close that device when done with it."

Does this example need to handle this?

Also, do  you not worry that its an internal class?

Does this example need to handle this?

Correct, of course, to close the device:

#include %systemInclude

#dim cDev As %String $IO

; see "dir /?" and RunCommandViaZF()
##class(%Net.Remote.Utility).RunCommandViaCPIPE(...,.dev,.con)
/*
...
*/
c:($get(dev)'=""dev:"I"
cDev
Also, do you not worry that its an internal class?

No, since for me internal ≠ deprecated.

However, given the openness of the source code, you can make own similar method, thereby to protect yourself from possible issues in the future.

While it is true that Internal does not mean deprecated it is still not recommended that you utilize such items in your application code.  Internal means that this is for InterSystems internal use only.  Anything with this flag can change or be removed with no warning.

Anything with this flag can change or be removed with no warning.

There are a few comments:

  • Let's say that the developers changed something in new versions of the DBMS. Is this a problem?

    It is enough to check Caché Recent Upgrade Checklists, where usually there is a ready list of changes that may affect existing user code, for example.

    Note that this can apply to absolutely any member of a class, and not even marked as [Internal]. Suffice it to recall the recent story with JSON support.

  • For a class does not exist flag [Internal], the warning is only on the level of comments.

    As for the other members of the class, according to the documentation this flag is for other purposes, namely:

    Internal class members are not displayed in the class documentation. This keyword is useful if you want users to see a class but not see all its members. proof
  • In any case, the final choice for developer.

If you want something more regular you could pipe the output via OS:
This also outputs more than 32000 kb.

 set errorLogDir = ##class(%File).TempFilename()
 set outputLogDir = ##class(%File).TempFilename()

 set command = "dir /A-D /B /S ""%1"" 2> ""%2"" > ""%3"""
 quit $zf(-1, $$$FormatText(command, "C:\InterSystems\Cache", errorLogDir, outputLogDir))

Now simply open and read the logs.

 

Yup, but if you read the source code, it's limited to 32000 kb.

And if so?

##class(%Net.Remote.Utility).RunCommandViaZF($$$FormatText("dir /A-D /B /S %1",$$$quote("C:\InterSystems\Atelier")),.tFileName,,,$$$NO)
f=##class(%Stream.FileCharacter).%New()
f.Filename=tFileName
while('f.AtEnd{f.ReadLine(),!}
f=""
##class(%File).Delete(tFileName)

 

ClassMethod GetFileTree(pFolder As %String, pWildcards As %String = "*", Output oFiles) As %Status
{
    set fileset=##class(%ResultSet).%New("%Library.File:FileSet")
    set sc=fileset.Execute(##class(%File).NormalizeDirectory(pFolder),pWildcards,,1)
    while $$$ISOK(sc),fileset.Next(.sc) {
        if fileset.Get("Type")="D" {
            set sc=..GetFileTree(fileset.Get("Name"),pWildcards,.oFiles)
        } else {
            set oFiles(fileset.Get("Name"))=""
        }    
    }
    quit sc
}

 

Search All...

set sc=##class(Some.Lib.File).GetFileTree("c:\Temp",,.files)


Search for specific file type...

set sc=##class(Some.Lib.File).GetFileTree("c:\Temp","*.html",.files)


Search for multiple files types

set sc=##class(Some.Lib.File).GetFileTree("c:\Temp","*.html;*.css",.files)

 

 

I don't recommend opening %ResultSet instances recursively.
It's more performatic if you open a single %SQL.Statement  and reuse that.

This will also solve you quite some bytes being allocated for the process as the depth keeps growing.

> I don't recommend opening %ResultSet instances recursively.

Agreed, but maybe splitting hairs if only used once per process

> It's more performatic if you open a single %SQL.Statement  and reuse that.

Actually, its MUCH slower, not sure why, just gave it a quick test, see for yourself...

ClassMethod GetFileTree(pFolder As %String, pWildcards As %String = "*", Output oFiles, ByRef pState = "") As %Status
{
    if pState="" set pState=##class(%SQL.Statement).%New()
    set sc=pState.%PrepareClassQuery("%File", "FileSet")
    set fileset=pState.%Execute(##class(%File).NormalizeDirectory(pFolder),pWildcards,,1)
    while $$$ISOK(sc),fileset.%Next(.sc) {
        if fileset.%Get("Type")="D" {
            set sc=..GetFileTree(fileset.%Get("Name"),pWildcards,.oFiles,.pState)
        } else {
            set oFiles(fileset.%Get("Name"))=""
        }    
    }
    quit sc
}

 

** EDITED **

This example recycles the FileSet (see comments below regarding performance)

ClassMethod GetFileTree3(pFolder As %String, pWildcards As %String = "*", Output oFiles, ByRef fileset = "") As %Status
{
    if fileset="" set fileset=##class(%ResultSet).%New("%Library.File:FileSet")
    set sc=fileset.Execute(##class(%File).NormalizeDirectory(pFolder),pWildcards,,1)
    while $$$ISOK(sc),fileset.Next(.sc) {
        if fileset.Get("Type")="D" {
            set dirs(fileset.Get("Name"))=""
        } else {
            set oFiles(fileset.Get("Name"))=""
        }    
    }
    set dir=$order(dirs(""))
    while dir'="" {
        set sc=..GetFileTree3(dir,pWildcards,.oFiles,.fileset)        
        set dir=$order(dirs(dir))
    }
    quit sc
}

 

 

What the ... ? wow.
I'll have to update my code if that's true for every case. Could you measure the execution time for both approaches?

I've removed the recycled resultset example, it is not working correctly. Might not work at all as a recycled approach, will look at it further and run more time tests if it works.

In the mean time, my original example without recycling the resultset, on a nest of folders with 10,000+ files takes around 2 seconds, where as the recycled SQL.Statement example takes around 14 seconds.

OK, I got the third example working, needed to stash the dirs as they were getting lost.

Here are the timings...

Recursive ResultSet  =  2.678719

Recycled ResultSet  =  2.6759

Recursive SQL.Statement  =  15.090297

Recycled SQL.Statement  =  15.073955

I've tried it with shallow and deep folders with different file counts and the differential is about the same for all three.

The recycled objects surprisingly only shave off a small amount of performance. I think this is because of bottlenecks elsewhere that over shadow the milliseconds saved.

SQL.Statement 6-7x slower that RestulSet is a surprise, but then the underlying implementation is not doing a database query which is where you would expect it to be the other way around.

The interesting thing now would be to benchmark one of the command line examples that have been given to compare.

Just for good measure, I benchmarked Vitaliy's last example and it completes the same test in 0.344022, so for out and out performance a solution built around this approach is going to be the quickest.

Vitaliy is faster that's probably because Caché is delegating the control to the OS's native API.

That is indeed the best approach, and could be made cross-platform by using $$$isUNIX, $$is$WINDOWS and $$$isVMS.
Now I gotta say, I'm impressed by these results. 


Ex: Using find -iname %1 instead of dir.

Then, what's the advantage of using %SQL.Statement over %ResultSet? The only reason I can think about is for using it's metadata now.

EDIT: Ahh, there's a detail. %FileSet is not persisted, neither is using SQL. It's a custom query being populated by $zsearch internally.

Maybe for SQL based queries %SQL.Statement would be better.

How about this, call with

    DO ^GETTREE("/home/user/dir/*",.result)
    and $ORDER() through result.

#INCLUDE %sySite
GETTREE(wild,result)    ;
    NEW (wild,result)
    SET s=$SELECT($$$ISUNIX:"/",$$$ISWINDOWS:"\",1:1/0)    ; separator
    SET w=$SELECT($$$ISUNIX:"*",$$$ISWINDOWS:"*.*")        ; wild-card
    SET todo(wild)=""
    FOR {
      SET q=$ORDER(todo("")) QUIT:q=""  KILL todo(q)
      SET f=$ZSEARCH(q) WHILE f'="" {
        SET t=$PIECE(f,s,$LENGTH(f,s)) QUIT:t="."  QUIT:t=".."
        SET result(f)=""
        SET todo(f_s_w)=""
        SET f=$ZSEARCH("")
      }
    }
    QUIT

Flaws:
On my mac, I have some directories so deep, $ZSEARCH() fails.
It doesn't work on OpenVMS. As much as you may want to think that you can convert dev:[dir]subdir.DIR;1 to dev:[dir.subdir]*.*;*, and keep searching, there are too many weird cases to deal with on OpenVMS, better to just write a $ZF() interface to LIB$FIND_FILE() and LIB$FIND_FILE_END().