Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use Range header and resume_size at the same time when use http_get function #2761

Open
narugo1992 opened this issue Jan 18, 2025 · 1 comment

Comments

@narugo1992
Copy link

Is your feature request related to a problem? Please describe.

We (deepghs) developed a huggingface tool hfutils, it has a feature which can download standalone file from the tar with out full-tar downloading.

Like this: https://hfutils.deepghs.org/main/api_doc/index/fetch.html#hf-tar-file-download

Now its implement is based on HTTP Range header, which can be inferred from the json index file, so that we can download the files inside tars like downloading them with url.

def hf_tar_file_download(repo_id: str, archive_in_repo: str, file_in_archive: str, local_file: str,
                         repo_type: RepoTypeTyping = 'dataset', revision: str = 'main',
                         idx_repo_id: Optional[str] = None, idx_file_in_repo: Optional[str] = None,
                         idx_repo_type: Optional[RepoTypeTyping] = None, idx_revision: Optional[str] = None,
                         proxies: Optional[Dict] = None, user_agent: Union[Dict, str, None] = None,
                         headers: Optional[Dict[str, str]] = None, endpoint: Optional[str] = None,
                         force_download: bool = False, silent: bool = False, hf_token: Optional[str] = None):
    """
    Download a specific file from a tar archive stored in a Hugging Face repository.

    This function allows you to extract and download a single file from a tar archive
    that is hosted in a Hugging Face repository. It handles authentication, supports
    different repository types, and can work with separate index repositories.

    :param repo_id: The identifier of the repository containing the tar archive.
    :type repo_id: str
    :param archive_in_repo: The path to the tar archive file within the repository.
    :type archive_in_repo: str
    :param file_in_archive: The path to the desired file inside the tar archive.
    :type file_in_archive: str
    :param local_file: The local path where the downloaded file will be saved.
    :type local_file: str
    :param repo_type: The type of the Hugging Face repository (e.g., 'dataset', 'model', 'space').
    :type repo_type: RepoTypeTyping, optional
    :param revision: The specific revision of the repository to use.
    :type revision: str, optional
    :param idx_repo_id: The identifier of a separate index repository, if applicable.
    :type idx_repo_id: str, optional
    :param idx_file_in_repo: The path to the index file in the index repository.
    :type idx_file_in_repo: str, optional
    :param idx_repo_type: The type of the index repository.
    :type idx_repo_type: RepoTypeTyping, optional
    :param idx_revision: The revision of the index repository.
    :type idx_revision: str, optional
    :param proxies: Proxy settings for the HTTP request.
    :type proxies: Dict, optional
    :param user_agent: Custom user agent for the HTTP request.
    :type user_agent: Union[Dict, str, None], optional
    :param headers: Additional headers for the HTTP request.
    :type headers: Dict[str, str], optional
    :param endpoint: Custom Hugging Face API endpoint.
    :type endpoint: str, optional
    :param force_download: If True, force re-download even if the file exists locally.
    :type force_download: bool
    :param silent: If True, suppress progress bar output.
    :type silent: bool
    :param hf_token: Hugging Face authentication token.
    :type hf_token: str, optional

    :raises FileNotFoundError: If the specified file is not found in the tar archive.
    :raises ArchiveStandaloneFileIncompleteDownload: If the download is incomplete.
    :raises ArchiveStandaloneFileHashNotMatch: If the downloaded file's hash doesn't match the expected hash.

    This function performs several steps:

    1. Retrieves the index of the tar archive.
    2. Checks if the desired file exists in the archive.
    3. Constructs the download URL and headers.
    4. Checks if the file already exists locally and matches the expected size and hash.
    5. Downloads the file if necessary, using byte range requests for efficiency.
    6. Verifies the downloaded file's size and hash.

    Usage examples:
        1. Basic usage:
            >>> hf_tar_file_download(
            ...     repo_id='deepghs/danbooru2024',
            ...     archive_in_repo='images/0000.tar',
            ...     file_in_archive='7506000.jpg',
            ...     local_file='test_example.jpg'  # download destination
            ... )

        2. Using a separate index repository:
            >>> hf_tar_file_download(
            ...     repo_id='nyanko7/danbooru2023',
            ...     idx_repo_id='deepghs/danbooru2023_index',
            ...     archive_in_repo='original/data-0000.tar',
            ...     file_in_archive='1000.png',
            ...     local_file='test_example.png'  # download destination
            ... )

    .. note::

        - This function is particularly useful for efficiently downloading single files from large tar archives
          without having to download the entire archive.
        - It supports authentication via the `hf_token` parameter, which is crucial for accessing private repositories.
        - The function includes checks to avoid unnecessary downloads and to ensure the integrity of the downloaded file.
    """
    index = hf_tar_get_index(
        repo_id=repo_id,
        archive_in_repo=archive_in_repo,
        repo_type=repo_type,
        revision=revision,

        idx_repo_id=idx_repo_id,
        idx_file_in_repo=idx_file_in_repo,
        idx_repo_type=idx_repo_type,
        idx_revision=idx_revision,

        hf_token=hf_token,
    )
    files = _hf_files_process(index['files'])
    if _n_path(file_in_archive) not in files:
        raise FileNotFoundError(f'File {file_in_archive!r} not found '
                                f'in {repo_type}s/{repo_id}@{revision}/{archive_in_repo}.')

    info = files[_n_path(file_in_archive)]

    url_to_download = hf_hub_url(repo_id, archive_in_repo, repo_type=repo_type, revision=revision, endpoint=endpoint)
    headers = build_hf_headers(
        token=hf_token,
        library_name=None,
        library_version=None,
        user_agent=user_agent,
        headers=headers,
    )
    start_bytes = info['offset']
    end_bytes = info['offset'] + info['size'] - 1
    headers['Range'] = f'bytes={start_bytes}-{end_bytes}'

    if not force_download and os.path.exists(local_file) and \
            os.path.isfile(local_file) and os.path.getsize(local_file) == info['size']:
        _expected_sha256 = info.get('sha256')
        if not _expected_sha256 or _f_sha256(local_file) == _expected_sha256:
            # file already ready, no need to download it again
            return

    if os.path.dirname(local_file):
        os.makedirs(os.path.dirname(local_file), exist_ok=True)
    try:
        with open(local_file, 'wb') as f, tqdm(disable=True) as empty_tqdm:
            if info['size'] > 0:
                http_get(
                    url_to_download,
                    f,
                    proxies=proxies,
                    resume_size=0,
                    headers=headers,
                    expected_size=info['size'],
                    displayed_filename=file_in_archive,
                    _tqdm_bar=empty_tqdm if silent else None,
                )

        if os.path.getsize(local_file) != info['size']:
            raise ArchiveStandaloneFileIncompleteDownload(
                f'Expected size is {info["size"]}, but actually {os.path.getsize(local_file)} downloaded.'
            )

        if info.get('sha256'):
            _sha256 = _f_sha256(local_file)
            if _sha256 != info['sha256']:
                raise ArchiveStandaloneFileHashNotMatch(
                    f'Expected hash is {info["sha256"]!r}, but actually {_sha256!r} found.'
                )

    except Exception:
        if os.path.exists(local_file):
            os.remove(local_file)
        raise

It is now using http_get function, with expected_size, headers and resume_size (always 0) provided.

I found when the http_get function is trying to resume downloading, it will try to download the full tar the next time.

because the following code from the huggingface_hub

    initial_headers = headers
    headers = copy.deepcopy(headers) or {}
    if resume_size > 0:
        headers["Range"] = "bytes=%d-" % (resume_size,)

and

        new_resume_size = resume_size
        try:
            for chunk in r.iter_content(chunk_size=constants.DOWNLOAD_CHUNK_SIZE):
                if chunk:  # filter out keep-alive new chunks
                    progress.update(len(chunk))
                    temp_file.write(chunk)
                    new_resume_size += len(chunk)
                    # Some data has been downloaded from the server so we reset the number of retries.
                    _nb_retries = 5

which means in the next resume attempt, the original Range headers will be overwritten by the resume_size.

So I suggest we can let the resume size and headers be used together, like, if you provide Range: bytes=x-y and resume_size t together, we can replace the header to Range: bytes=(x+t)-y, not simply Range: bytes=t-.

I can create a pr for this, my plan is to change the resume_size-overwriting code to make them work together. Any suggestions?

@hanouticelina
Copy link
Contributor

Hi @narugo1992, thank you!
supporting this use case makes sense. Indeed, http_get was initially designed to download complete files, which explains why the current behavior simply overwrites the Range header with bytes=resume_size-.
The solution seems relatively straightforward to implement, maybe with something like that:

# in http_get
if resume_size > 0:
    range_value = headers.get("Range", "")
    if range_value.startswith("bytes=") and "-" in range_value[6:]:
        start, end = range_value[6:].split("-", 1)
        headers["Range"] = f"bytes={int(start) + resume_size}-{end}"
    else:
        headers["Range"] = f"bytes={resume_size}-"

It would be awesome if you create a PR for this! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants