Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce data loading utility for reading from local cache or downloading from external URL #3282

Closed
wants to merge 1 commit into from

Conversation

ltiao
Copy link
Contributor

@ltiao ltiao commented Jan 29, 2025

Summary:

Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

Changes

This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:

  • Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
  • Implemented methods for:
    • Getting the cache path
    • Checking if the data is cached
    • Reading the data from the cache
    • Downloading from an external URL and caching the data
  • Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from ax.fb to ax.

Differential Revision: D68790695

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 29, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

ltiao added a commit to ltiao/Ax that referenced this pull request Jan 29, 2025
…ading from external URL (facebook#3282)

Summary:
Pull Request resolved: facebook#3282

## Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

## Changes
This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:
- Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
- Implemented methods for:
  * Getting the cache path
  * Checking if the data is cached
  * Reading the data from the cache
  * Downloading from an external URL and caching the data
- Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from `ax.fb` to `ax`.

Differential Revision: D68790695
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

ltiao added a commit to ltiao/Ax that referenced this pull request Jan 30, 2025
…ading from external URL (facebook#3282)

Summary:
Pull Request resolved: facebook#3282

## Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

## Changes
This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:
- Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
- Implemented methods for:
  * Getting the cache path
  * Checking if the data is cached
  * Reading the data from the cache
  * Downloading from an external URL and caching the data
- Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from `ax.fb` to `ax`.

## WIP/TODO

1. Add new unit tests
2. Address OSS coverage requirements

Differential Revision: D68790695
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

ltiao added a commit to ltiao/Ax that referenced this pull request Jan 30, 2025
…ading from external URL (facebook#3282)

Summary:
Pull Request resolved: facebook#3282

## Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

## Changes
This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:
- Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
- Implemented methods for:
  * Getting the cache path
  * Checking if the data is cached
  * Reading the data from the cache
  * Downloading from an external URL and caching the data
- Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from `ax.fb` to `ax`.

## WIP/TODO

1. Add new unit tests
2. Address OSS coverage requirements

Differential Revision: D68790695
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

ltiao added a commit to ltiao/Ax that referenced this pull request Jan 30, 2025
…ading from external URL (facebook#3282)

Summary:
Pull Request resolved: facebook#3282

## Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

## Changes
This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:
- Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
- Implemented methods for:
  * Getting the cache path
  * Checking if the data is cached
  * Reading the data from the cache
  * Downloading from an external URL and caching the data
- Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from `ax.fb` to `ax`.

## WIP/TODO

1. Add new unit tests
2. Address OSS coverage requirements

Differential Revision: D68790695
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

ltiao added a commit to ltiao/Ax that referenced this pull request Jan 30, 2025
…ading from external URL (facebook#3282)

Summary:
Pull Request resolved: facebook#3282

## Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

## Changes
This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:
- Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
- Implemented methods for:
  * Getting the cache path
  * Checking if the data is cached
  * Reading the data from the cache
  * Downloading from an external URL and caching the data
- Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from `ax.fb` to `ax`.

## WIP/TODO

1. Add new unit tests
2. Address OSS coverage requirements

Differential Revision: D68790695
@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 45.42484% with 167 lines in your changes missing coverage. Please review.

Project coverage is 95.74%. Comparing base (257af9c) to head (61be9d1).

Files with missing lines Patch % Lines
...hmark/problems/surrogate/lcbench/early_stopping.py 0.00% 81 Missing ⚠️
ax/benchmark/problems/surrogate/lcbench/data.py 0.00% 74 Missing ⚠️
...rk/problems/surrogate/lcbench/transfer_learning.py 88.63% 5 Missing ⚠️
ax/benchmark/problems/surrogate/lcbench/utils.py 68.75% 5 Missing ⚠️
ax/benchmark/problems/data.py 94.44% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3282      +/-   ##
==========================================
- Coverage   96.04%   95.74%   -0.30%     
==========================================
  Files         518      525       +7     
  Lines       52162    52468     +306     
==========================================
+ Hits        50098    50238     +140     
- Misses       2064     2230     +166     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

ltiao added a commit to ltiao/Ax that referenced this pull request Jan 31, 2025
…ading from external URL (facebook#3282)

Summary:
Pull Request resolved: facebook#3282

## Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

## Changes
This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:
- Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
- Implemented methods for:
  * Getting the cache path
  * Checking if the data is cached
  * Reading the data from the cache
  * Downloading from an external URL and caching the data
- Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from `ax.fb` to `ax`.

## WIP/TODO

1. Add new unit tests
2. Address OSS coverage requirements

Differential Revision: D68790695
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

ltiao added a commit to ltiao/Ax that referenced this pull request Feb 4, 2025
…ading from external URL (facebook#3282)

Summary:
Pull Request resolved: facebook#3282

## Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

## Changes
This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:
- Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
- Implemented methods for:
  * Getting the cache path
  * Checking if the data is cached
  * Reading the data from the cache
  * Downloading from an external URL and caching the data
- Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from `ax.fb` to `ax`.

## WIP/TODO

1. Add new unit tests
2. Address OSS coverage requirements

Reviewed By: esantorella

Differential Revision: D68790695
…ading from external URL (facebook#3282)

Summary:
Pull Request resolved: facebook#3282

## Context

Our preprocessed and compressed derivatives of open-source benchmarking datasets (e.g., LCBench) are currently hosted in Manifold blob storage, which limits their accessibility in our open-source software (OSS). To address this, we need to remove the dependency on Manifold.

## Changes
This diff introduces a data download utility that enables loading Pandas DataFrames (stored in a compressed parquet format) from local disk or downloading it from an external URL source if not found. The key changes include:
- Introduced AbstractParquetDataLoader class, providing a way to load parquet data from a cache on local disk or download from an external URL.
- Implemented methods for:
  * Getting the cache path
  * Checking if the data is cached
  * Reading the data from the cache
  * Downloading from an external URL and caching the data
- Added abstract properties for getting the directory name and URL of the cached file, allowing easy specialization for other benchmark datasets.

With these changes, we can now make our LCBench surrograte benchmark problems accessible in OSS and move from `ax.fb` to `ax`.

## WIP/TODO

1. Add new unit tests
2. Address OSS coverage requirements

Reviewed By: esantorella

Differential Revision: D68790695
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68790695

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 38916d1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants