Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace LIMIT/OFFSET with Psycopg2 server-side cursor #140

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anthonyfok
Copy link
Member

Fixes #138

(Draft, to be tested in full stack run.)

@anthonyfok anthonyfok added this to the Sprint 45 milestone Oct 22, 2021
@anthonyfok anthonyfok self-assigned this Oct 22, 2021
@anthonyfok anthonyfok force-pushed the use-psycopg2-server-side-cursor branch from d3f2c80 to e54dfba Compare October 22, 2021 18:20
@anthonyfok anthonyfok force-pushed the use-psycopg2-server-side-cursor branch from e54dfba to fa2778e Compare October 22, 2021 18:25
@drotheram
Copy link
Contributor

Nice work. Would be interested to see some performance comparisons. LIMIT/OFFSET is also used extensively in many of the *_postgres2es.py scripts.
Would like to make sure that changing the pagination method is at least:

  • more efficient than LIMIT/OFFSET
  • as robust in making sure that all records and no duplicates are copied from PostGIS to ES

@anthonyfok
Copy link
Member Author

Thank you @drotheram for the great ideas! You have hit the nail on the head, and indeed, my first stack run has already failed — related to #137, used a wrong search-and-replace regex and changing python3 to pypy3 in too many places, and Debian's pypy3 and python3-numpy not working together. Easy to fix, but does show how fragile these changes can be, especially when hurriedly committed in a somewhat sleepy state. 😅

But yeah, there are three main changes in this series of pull requests:

  1. New base OS image for python-env, and slightly updated Python libraries
  2. The switch from CPython to PyPy for some scripts
  3. LIMIT/OFFSET vs server-side cursor

and each of them can introduce unintended changes to the data. I have been wondering e.g. how to verify that the actual data exported to Elasticsearch are identical: Maybe dumping the GeoJSON data for each 10000 rows as individual files to compare? Maybe using a separate script to query the Elasticsearch server?

Ideally, probably need to run 5 test cases:

  1. v1.2.0 as control
  2. with the new debian:sid-20201012-slim-based python-env OS image, but nothing else changed. (CPython + LIMIT/OFFSET)
  3. No. 2 + PyPy + LIMIT/OFFSET
  4. No. 2 + CPython + server-side cursor
  5. No. 2 + PyPy + server-side cursor

Ditto for the benchmark, and in particular, I am interested to find out whether the LIMIT/OFFSET method does gets slower and slower when OFFSET gets really big. Yes, let's record the timestamps and plot them! And maybe have these tests and benchmarks as part of the Python scripts so we can monitor data integrity and performance over time.

Sounds a bit scary but at the same time exciting to me, but yeah, this would be something that let us monitor the health and performance of our stack. Probably as something that CI/CD with GitHub Actions could do.

And before I forget, I am also starting to think that in the future we could have the Docker Compose logs, test and benchmark results semi-automatically uploaded to an S3 bucket for record and analysis. (Privacy alert!) With the end user's consent of course. And also collect e.g. CPU, RAM, hard disk information too... kind of like CPU-ID or Passmark... Or maybe not that good of an idea. 😅

@anthonyfok anthonyfok modified the milestones: Sprint 45, Sprint 46 Nov 8, 2021
@anthonyfok anthonyfok modified the milestones: Sprint 46, Sprint 47 Nov 22, 2021
@anthonyfok anthonyfok removed this from the Sprint 47 milestone Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Explore alternative methods of SQL pagination for speed
2 participants