A fetch_all() method for the Python client class

Question

Question

Mads Lien · Jul 15, 2016

It would be convenient to be able to fetch the whole query result in one operation instead of looping through every row. A fetch_all() method or something similar would provide this functionality I guess. Is there a reason why this is not implemented?

Discussion (7)1

Log in or sign up to continue

Alexander Koblov · Jul 15, 2016

And fetch_all() can be implemented using fetch().

0 0

score 2 · Answer 1 · 2016-07-15T11:16:41-04:00

That's generally a questionable idea, here's why:

It's the slowest possible method - if we're iterating over result set the application logic may decide to end processing not after the last row but earlier. If a whole result set is returned, then we still spent CPU/RAM to calculate something we may not even need
Result set may be quite large which would cause application logic errors
Result set may be larger than amount of RAM available
Time to first result is usually smaller than getting all results, so the user can start working with the first row, while an application fetches more rows

score 1 · Answer 2 · 2016-10-13T11:05:28-04:00

What's speed are you getting? Rows per second and average size of row for example.

Have you tried comparing that speed with direct iteration over a same resultset in COS?

score 0 · Answer 3 · 2016-10-18T07:35:13-04:00

I think the issue is solved. I rewrote the python code and now the process is quick. It seems that the pandas method of adding rows to a dataframe is not suited for use in loops. So I added the rows from the query to a nested list and created a dataframe from the list ouside the loop instead.

Thanks for your input on this issue.

score 0 · Answer 4 · 2016-10-17T10:57:05-04:00

Write the same dataset loop (with embedded sql) in Caché ObjectScript and see how much time does it take to fetch the same rows.

score 0 · Answer 5 · 2016-10-17T09:09:07-04:00

I have measured speeds now and for a dataset of 35.000 rows it starts out at 4-5 ms per row added to the dataframe and increases to about 9 ms at the end of the loop. For bigger data sets with millions of rows this will take a very long time. The response from the database takes less than a second and the rows contain 5 fields. I am using this method in the loop:

row = query.fetch([None])
df.loc[len(df)] = row

What do you mean by COS and how can I do this comparison?

score 0 · Answer 6 · 2016-10-13T08:45:00-04:00

Thanks for the answer Eduard.

I might be looking in the wrong place for a solution to my problem. I am working with larger data sets, up to 15-20 million rows, and using the python pandas library for data manipulation and analysis. To populate a dataframe I fetch rows from the query object one row at the time and this is very slow. After the dataframe is populated there are no issues with speed so I suspect that this process is the culprit. Is there a more efficient way of doing this?