There is a large file that needs to be processed, with over 500 000 rows. Each row needs to be verified for various data and then all the results collated and reported into a single report. What is the most efficient way to do this? I tried calling the processing function with Job , where each function jobbed off would report results to a different node of a common global. But the jobbed functions are not updating their respective rows even though I am passing the global name and root node. What is the most efficient way to process this large file? Any Cache/Ensemble ideas are appreciated.
Sounds like you might be adding unnecessary complexity.
> What is the most efficient way to process this large file?
That really depends on your definition of efficiency.
If you want to solve the problem with the least amount of watts then solving the problem with a single process would be the most efficient.
If you add more processes then you will be executing additional code to co-ordinate responsibilities. There is also the danger that competing processes will flush data blocks out of memory in a less efficient way.
If you want to solve the problem with speed then its important to understand where the bottlenecks are before trying to optimise anything (avoid premature optimisation).
If your process is taking a long time (hours not minutes) then you will most likely have data queries that have a high relative cost. It's not uncommon to have a large job like this run 1000x quicker just by adding the right index in the right place.
Normally I would write a large (single) process job like this and then observe it in the management portal (System>Process>Process Details). If I see its labouring over a specific global then I can track back to where the index might be needed.
You will then get further efficiencies / speed gains by making sure the tables are tuned and that Cache has as much configured memory cache as you can afford.
If you are writing lots of data during this process then also consider using a temporary global that won't hit the transaction files. If the process is repeatable from the file then there is no danger of losing these temp globals during a crash as you can just restart the job after the restore.
Lastly, I would avoid using Ensemble for this. The last thing you want to do is generate 500,000 Ensemble messages if there is no need to integrate the rows of data with anything other than internal data tables.
Correction. It's perfectly fine (for Ensemble) to ingest your file and process it as a single message stream. What I wouldn't do is split the file into 500,000 messages when there is no need to do this. Doing so would obviously cause additional IO.