Skip to content

Conversation

@kevinmingtarja
Copy link
Collaborator

@kevinmingtarja kevinmingtarja commented Jul 15, 2025

When a new catalog file is fetched, there might be a difference for the catalogs among different executor processes, despite there being a single source of truth, on disk at ~/.sky/catalogs/*.

TL;DR this is because read_catalog returns a LazyDataFrame, which gets assigned to a global variable, for each unique (cloud, catalog) pair. And LazyDataFrame caches the DataFrame in memory for the duration of the process' lifetime, because it only calls update_func if self._df is None, which is only when it's first called.

This PR changes the behaviour of LazyDataFrame, such that now, update_if_stale_func returns a bool, that indicates if an update was done or not, and this is called on every read (_load_df), to ensure that a long running executor process is aware if it is time to refresh the catalog, and save the new one to disk.

To account for the performance penalty of calling the update if stale function for every read, we add two optimizations:

  1. In _update_catalog of read_catalog, we add a fast path by calling _need_update before trying to acquire the file lock, to prevent lock contentions. We still need to do it after acquiring the lock, to ensure only one process gets to write to disk, so that part is left as is
  2. Add @annotations.lru_cache(scope='request') to _load_df in LazyDataFrame. A single request could read a catalog multiple times, so this helps alleviate unnecessary calls to update_if_stale_func

I added a unit test in test_catalog.py

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kevinmingtarja kevinmingtarja changed the title make sure catalogs get updated periodically on the executor processes Make sure catalogs get updated periodically on the executor processes Jul 15, 2025
Copy link
Collaborator

@rohansonecha rohansonecha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @kevinmingtarja! Left one small comment.

@kevinmingtarja
Copy link
Collaborator Author

Thanks for the review @rohansonecha !

@kevinmingtarja kevinmingtarja merged commit d2c6ec4 into master Jul 16, 2025
15 checks passed
@kevinmingtarja kevinmingtarja deleted the fix-executor-stale-catalogs branch July 16, 2025 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants