You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now its implement is based on HTTP Range header, which can be inferred from the json index file, so that we can download the files inside tars like downloading them with url.
defhf_tar_file_download(repo_id: str, archive_in_repo: str, file_in_archive: str, local_file: str,
repo_type: RepoTypeTyping='dataset', revision: str='main',
idx_repo_id: Optional[str] =None, idx_file_in_repo: Optional[str] =None,
idx_repo_type: Optional[RepoTypeTyping] =None, idx_revision: Optional[str] =None,
proxies: Optional[Dict] =None, user_agent: Union[Dict, str, None] =None,
headers: Optional[Dict[str, str]] =None, endpoint: Optional[str] =None,
force_download: bool=False, silent: bool=False, hf_token: Optional[str] =None):
""" Download a specific file from a tar archive stored in a Hugging Face repository. This function allows you to extract and download a single file from a tar archive that is hosted in a Hugging Face repository. It handles authentication, supports different repository types, and can work with separate index repositories. :param repo_id: The identifier of the repository containing the tar archive. :type repo_id: str :param archive_in_repo: The path to the tar archive file within the repository. :type archive_in_repo: str :param file_in_archive: The path to the desired file inside the tar archive. :type file_in_archive: str :param local_file: The local path where the downloaded file will be saved. :type local_file: str :param repo_type: The type of the Hugging Face repository (e.g., 'dataset', 'model', 'space'). :type repo_type: RepoTypeTyping, optional :param revision: The specific revision of the repository to use. :type revision: str, optional :param idx_repo_id: The identifier of a separate index repository, if applicable. :type idx_repo_id: str, optional :param idx_file_in_repo: The path to the index file in the index repository. :type idx_file_in_repo: str, optional :param idx_repo_type: The type of the index repository. :type idx_repo_type: RepoTypeTyping, optional :param idx_revision: The revision of the index repository. :type idx_revision: str, optional :param proxies: Proxy settings for the HTTP request. :type proxies: Dict, optional :param user_agent: Custom user agent for the HTTP request. :type user_agent: Union[Dict, str, None], optional :param headers: Additional headers for the HTTP request. :type headers: Dict[str, str], optional :param endpoint: Custom Hugging Face API endpoint. :type endpoint: str, optional :param force_download: If True, force re-download even if the file exists locally. :type force_download: bool :param silent: If True, suppress progress bar output. :type silent: bool :param hf_token: Hugging Face authentication token. :type hf_token: str, optional :raises FileNotFoundError: If the specified file is not found in the tar archive. :raises ArchiveStandaloneFileIncompleteDownload: If the download is incomplete. :raises ArchiveStandaloneFileHashNotMatch: If the downloaded file's hash doesn't match the expected hash. This function performs several steps: 1. Retrieves the index of the tar archive. 2. Checks if the desired file exists in the archive. 3. Constructs the download URL and headers. 4. Checks if the file already exists locally and matches the expected size and hash. 5. Downloads the file if necessary, using byte range requests for efficiency. 6. Verifies the downloaded file's size and hash. Usage examples: 1. Basic usage: >>> hf_tar_file_download( ... repo_id='deepghs/danbooru2024', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... local_file='test_example.jpg' # download destination ... ) 2. Using a separate index repository: >>> hf_tar_file_download( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png', ... local_file='test_example.png' # download destination ... ) .. note:: - This function is particularly useful for efficiently downloading single files from large tar archives without having to download the entire archive. - It supports authentication via the `hf_token` parameter, which is crucial for accessing private repositories. - The function includes checks to avoid unnecessary downloads and to ensure the integrity of the downloaded file. """index=hf_tar_get_index(
repo_id=repo_id,
archive_in_repo=archive_in_repo,
repo_type=repo_type,
revision=revision,
idx_repo_id=idx_repo_id,
idx_file_in_repo=idx_file_in_repo,
idx_repo_type=idx_repo_type,
idx_revision=idx_revision,
hf_token=hf_token,
)
files=_hf_files_process(index['files'])
if_n_path(file_in_archive) notinfiles:
raiseFileNotFoundError(f'File {file_in_archive!r} not found 'f'in {repo_type}s/{repo_id}@{revision}/{archive_in_repo}.')
info=files[_n_path(file_in_archive)]
url_to_download=hf_hub_url(repo_id, archive_in_repo, repo_type=repo_type, revision=revision, endpoint=endpoint)
headers=build_hf_headers(
token=hf_token,
library_name=None,
library_version=None,
user_agent=user_agent,
headers=headers,
)
start_bytes=info['offset']
end_bytes=info['offset'] +info['size'] -1headers['Range'] =f'bytes={start_bytes}-{end_bytes}'ifnotforce_downloadandos.path.exists(local_file) and \
os.path.isfile(local_file) andos.path.getsize(local_file) ==info['size']:
_expected_sha256=info.get('sha256')
ifnot_expected_sha256or_f_sha256(local_file) ==_expected_sha256:
# file already ready, no need to download it againreturnifos.path.dirname(local_file):
os.makedirs(os.path.dirname(local_file), exist_ok=True)
try:
withopen(local_file, 'wb') asf, tqdm(disable=True) asempty_tqdm:
ifinfo['size'] >0:
http_get(
url_to_download,
f,
proxies=proxies,
resume_size=0,
headers=headers,
expected_size=info['size'],
displayed_filename=file_in_archive,
_tqdm_bar=empty_tqdmifsilentelseNone,
)
ifos.path.getsize(local_file) !=info['size']:
raiseArchiveStandaloneFileIncompleteDownload(
f'Expected size is {info["size"]}, but actually {os.path.getsize(local_file)} downloaded.'
)
ifinfo.get('sha256'):
_sha256=_f_sha256(local_file)
if_sha256!=info['sha256']:
raiseArchiveStandaloneFileHashNotMatch(
f'Expected hash is {info["sha256"]!r}, but actually {_sha256!r} found.'
)
exceptException:
ifos.path.exists(local_file):
os.remove(local_file)
raise
It is now using http_get function, with expected_size, headers and resume_size (always 0) provided.
I found when the http_get function is trying to resume downloading, it will try to download the full tar the next time.
because the following code from the huggingface_hub
initial_headers=headersheaders=copy.deepcopy(headers) or {}
ifresume_size>0:
headers["Range"] ="bytes=%d-"% (resume_size,)
and
new_resume_size=resume_sizetry:
forchunkinr.iter_content(chunk_size=constants.DOWNLOAD_CHUNK_SIZE):
ifchunk: # filter out keep-alive new chunksprogress.update(len(chunk))
temp_file.write(chunk)
new_resume_size+=len(chunk)
# Some data has been downloaded from the server so we reset the number of retries._nb_retries=5
which means in the next resume attempt, the original Range headers will be overwritten by the resume_size.
So I suggest we can let the resume size and headers be used together, like, if you provide Range: bytes=x-y and resume_size t together, we can replace the header to Range: bytes=(x+t)-y, not simply Range: bytes=t-.
I can create a pr for this, my plan is to change the resume_size-overwriting code to make them work together. Any suggestions?
The text was updated successfully, but these errors were encountered:
Hi @narugo1992, thank you!
supporting this use case makes sense. Indeed, http_get was initially designed to download complete files, which explains why the current behavior simply overwrites the Range header with bytes=resume_size-.
The solution seems relatively straightforward to implement, maybe with something like that:
Is your feature request related to a problem? Please describe.
We (deepghs) developed a huggingface tool hfutils, it has a feature which can download standalone file from the tar with out full-tar downloading.
Like this: https://hfutils.deepghs.org/main/api_doc/index/fetch.html#hf-tar-file-download
Now its implement is based on HTTP Range header, which can be inferred from the json index file, so that we can download the files inside tars like downloading them with url.
It is now using
http_get
function, withexpected_size
,headers
andresume_size
(always 0) provided.I found when the
http_get
function is trying to resume downloading, it will try to download the full tar the next time.because the following code from the huggingface_hub
and
which means in the next resume attempt, the original Range headers will be overwritten by the resume_size.
So I suggest we can let the resume size and headers be used together, like, if you provide
Range: bytes=x-y
and resume_sizet
together, we can replace the header toRange: bytes=(x+t)-y
, not simplyRange: bytes=t-
.I can create a pr for this, my plan is to change the resume_size-overwriting code to make them work together. Any suggestions?
The text was updated successfully, but these errors were encountered: