Using multiple processes to process a single file
It is taking several hours to read a large text file because the while loop uses ReadLine() ! Is there some way in Cache to process a single file using multiple processes. Something comparable to this:-
http://stackoverflow.com/questions/11196367/processing-single-file-from-...
Hi,
it depends a little bit what exactly the bottleneck is. If it's the processing, you could possibly split the reading from the processing by inserting a scheduling mechanism and just have the read process queue up things and several worker threads for the processing.
If the bottleneck is the reading, there is not much you can do. Reading from the same file in multiple locations is going to make it slower overall.
Best,
Fab
Can you show us the code of your while loop, in case DC members can suggest performance improvements?
This is the while loop in the code of a FileService class that extends EnsLib.RecordMap.Service.FileService.
While 'tInputStream.AtEnd {
Set tSC = ..GetRecordObject(tInputStream, .tRecordObject,,.tLookAhead)
If $$$ISERR(tSC) Quit
Set tSC = ..SendRequest(tRecordObject, 1)
If $$$ISERR(tSC) Quit
}
The requests are asynchronously sent to a target. But there is significant time being taken for the while loop to complete
Are you creating 500,000 Ensemble messages from one file?
Please see previous comment to same question...
https://community.intersystems.com/post/multi-threading-improve-performance
You previously mentioned your file contains 500,000 records. How big are the records?
If each record is just a few k in size then reading and writing each record from the file to a global will take under 5 minutes.
In which case you have a serious bottleneck going on such as a poor referential integrity check. If this is not indexed or tuned then you will have some serious IO thrashing going on, and no matter how many cores you throw at the problem, you will not get any overall performance gains.
Previous comment recommended processing the file as a single message stream. That ended up slowing the message viewer so much for these large messages that that it is impossible to view the message at all because the stream is too large .So this line by line approach is being explored.
What is your definition of a few k? Each line is about 25000 KB.
> What is your definition of a few k? Each line is about 25000 KB.
Do you mean 25,000 characters (25K)?
> Previous comment recommended processing the file as a single message stream. That ended up slowing the message viewer so much for these large messages that that it is impossible to view the message at all.
You can override the content display method on your Ens.Request class so that it doesn't display the entire message. You can replace this with a small summary about the file, size, no of records etc.
Creating 500,000 Ensemble messages is going to generate a lot of IO that you probably don't need.
I would still recommend processing them as one file.