python 通过代理服务器 连接 huggingface下载模型,并运行 pipeline

news/2024/7/5 3:07:39

想在Python 代码中运行时下载模型,启动代理服务器客户端后

1. 检查能否科学上网

$ curl -x socks5h://127.0.0.1:1080 https://www.example.com
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

2. 用 Python 代码检验

test_proxy.py :

import requests

url1 = 'https://www.example.com'
url2 = 'https://huggingface.co/'
proxies = {
    'http': 'socks5h://localhost:1080',
    'https': 'socks5h://localhost:1080'
}

try:
    response = requests.get(url1, proxies=proxies)
    if response.status_code == 200:
        print("成功连接到代理服务器并获取数据!")
        print("响应内容:", response.text)
    else:
        print("连接到代理服务器失败。请检查代理设置和网络连接。")
except requests.exceptions.RequestException as e:
    print("请求发生异常:", str(e))

输出结果:

成功连接到代理服务器并获取数据!
响应内容: <!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>


Process finished with exit code 0

成功连接

3. 运行下载模型的代码

download_model.py

import os
import json
import requests
from uuid import uuid4
from tqdm import tqdm

proxies = {
    'http': 'socks5h://localhost:1080',
    'https': 'socks5h://localhost:1080'
}

#使用uuid4()函数生成一个唯一的会话ID,用于在请求的标头中加以标识
SESSIONID = uuid4().hex

VOCAB_FILE = "vocab.txt"
CONFIG_FILE = "config.json"
MODEL_FILE = "pytorch_model.bin"
BASE_URL = "https://huggingface.co/{}/resolve/main/{}"

headers = {'user-agent': 'transformers/4.38.2; python/3.11.8;  \
			session_id/{}; torch/2.2.1; tensorflow/2.15.0; \
			file_type/model; framework/pytorch; from_auto_class/False'.format(SESSIONID)}

model_id = "distilbert-base-uncased-finetuned-sst-2-english"

# 创建模型对应的文件夹

model_dir = model_id.replace("/", "-")

print(model_dir)

if not os.path.exists(model_dir):
	os.mkdir(model_dir)


# vocab 和 config 文件可以直接下载
# 使用requests.get()函数向Hugging Face的API发送GET请求来下载词典文件和配置文件
r = requests.get(BASE_URL.format(model_id, VOCAB_FILE), headers=headers,proxies=proxies)
r.encoding = "utf-8"
with open(os.path.join(model_dir, VOCAB_FILE), "w", encoding="utf-8") as f:
	# print(r.text)
	f.write(r.text)
	print("{}词典文件下载完毕!".format(model_id))

r = requests.get(BASE_URL.format(model_id, CONFIG_FILE), headers=headers,proxies=proxies)
r.encoding = "utf-8"
with open(os.path.join(model_dir, CONFIG_FILE), "w", encoding="utf-8") as f:
	# print(r.status_code)
	# print(r.text)
	json.dump(r.json(), f, indent="\t")
	print("{}配置文件下载完毕!".format(model_id))


# 模型文件需要分两步进行

# Step1 获取模型下载的真实地址
r = requests.head(BASE_URL.format(model_id, MODEL_FILE), headers=headers,proxies=proxies)
r.raise_for_status()
if 300 <= r.status_code <= 399:
	url_to_download = r.headers["Location"]

# Step2 请求真实地址下载模型
# stream=True 启用逐块下载模式,响应内容将被分成多个小块进行下载
r = requests.get(url_to_download, stream=True,headers=None,proxies=proxies)
r.raise_for_status()

# 这里的进度条是可选项,直接使用了transformers包中的代码
# headers.get()方法从响应头中获取"Content-Length"字段的值。"Content-Length"表示下载文件的总大小,以字节为单位。
content_length = r.headers.get("Content-Length")
total = int(content_length) if content_length is not None else None
"""
参数unit="B"表示进度条以字节为单位。
unit_scale=True将自动调整进度条的单位以便更好地显示,例如,以KB、MB或GB为单位。
total参数设置进度条的总大小。initial=0表示进度条的初始值为0。
desc="Downloading Model"是进度条的描述,用于显示在进度条前面"""
progress = tqdm(
	unit="B",
	unit_scale=True,
	total=total,
	initial=0,
	desc="Downloading Model",
)
"""
使用iter_content()方法以指定的块大小(这里是1024字节)迭代下载的内容。
每次迭代,将一个块的内容存储在chunk变量中。
在每个块的迭代过程中,首先通过条件if chunk过滤掉空的块,以排除保持连接的新块。"""
with open(os.path.join(model_dir, MODEL_FILE), "wb") as temp_file:
	for chunk in r.iter_content(chunk_size=1024):
		if chunk:  # filter out keep-alive new chunks
			progress.update(len(chunk))
			temp_file.write(chunk)
progress.close()

print("{}模型文件下载完毕!".format(model_id))

速度还是可以的:
在这里插入图片描述
如果想运行pipeline 代码:

text_classification = pipeline("text-classification")

会出现:

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://hf-mirror.com/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.

这时把上面改上面代码:

model_id = "distilbert-base-uncased-finetuned-sst-2-english"

4. 运行 pipeline 代码

pipeline.py

from transformers import pipeline
import urllib.request


print(urllib.request.getproxies())

text_classification = pipeline("text-classification")
result = text_classification("Hello, world!")
print(result)

结果报错:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/wxf/PycharmProjects/llm/pipe.py", line 21, in <module>
    text_classification = pipeline("text-classification")
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wxf/lib/anaconda/envs/transformers/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 879, in pipeline
    config = AutoConfig.from_pretrained(model, _from_pipeline=task, **hub_kwargs, **model_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wxf/lib/anaconda/envs/transformers/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wxf/lib/anaconda/envs/transformers/lib/python3.11/site-packages/transformers/configuration_utils.py", line 633, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wxf/lib/anaconda/envs/transformers/lib/python3.11/site-packages/transformers/configuration_utils.py", line 688, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/home/wxf/lib/anaconda/envs/transformers/lib/python3.11/site-packages/transformers/utils/hub.py", line 441, in cached_file
    raise EnvironmentError(
OSError: We couldn't connect to 'https://hf-mirror.com' to load this file, couldn't find it in the cached files and it looks like distilbert/distilbert-base-uncased-finetuned-sst-2-english is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

代码 /home/wxf/lib/anaconda/envs/transformers/lib/python3.11/site-packages/transformers/configuration_utils.py
改成:

resolved_config_file = cached_file(
                    pretrained_model_name_or_path,
                    configuration_file,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    proxies={
                    'http': 'socks5h://localhost:1080',
                    'https': 'socks5h://localhost:1080'
                    },
                    resume_download=resume_download,
                    local_files_only=local_files_only,
                    token=token,
                    user_agent=user_agent,
                    revision=revision,
                    subfolder=subfolder,
                    _commit_hash=commit_hash,
                )

然后运行结果:

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://hf-mirror.com/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9997164607048035}]

5. 参考

  1. huggingface transformers预训练模型如何下载至本地,并使用?
  2. 国内用户 HuggingFace 高速下载
  3. huggingface transformers预训练模型如何下载至本地,并使用?
  4. Huggingface的from pretrained的下载代理服务器方法设置

http://lihuaxi.xjx100.cn/news/2088275.html

相关文章

AI语音自动化脚本开发

本篇文章主要介绍如何使用python实现文字转换成语音文件&#xff0c;电脑执行语音文件&#xff0c;使用音响进行播放&#xff0c;然后对智慧屏执行的语料进行测试&#xff0c;在对语音执行效果进行断言&#xff0c;最后输出测试结果&#xff0c;不一定是智慧屏&#xff0c;也可…

高颜值抓包工具Charles,实现Mac和IOS端抓取https请求

Hi&#xff0c;大家好。在进行测试的过程中&#xff0c;不可避免的会有程序报错&#xff0c;为了能更快修复掉Bug&#xff0c;我们作为测试人员需要给开发人员提供更准确的报错信息或者接口地址&#xff0c;这个时候就需要用到我们的抓包工具。 常见的抓包工具有Fiddler、Char…

Java桥接模式源码剖析及使用场景

目录 一、介绍二、项目管理系统中使用桥接模式三、权限管理中使用桥接模式四、Java JDBC中使用桥接模式 一、介绍 它的主要目的是将抽象化与实现化分离&#xff0c;使得二者可以独立变化&#xff0c;就像一个桥&#xff0c;将两个变化维度连接起来。各个维度都可以独立的变化。…

TortoiseSVN 报错:The server unexpectedly closed the connetion

前言 CentOS7Linux 安装subversionmod_dav_svn&#xff0c;搭建subversion(svn)服务器 The server unexpectedly closed the connetion 解决办法 重启Apache服务 shell> systemctl restart httpd

每天学习一个Linux命令之tail

每天学习一个Linux命令之tail 在Linux系统中&#xff0c;有许多实用的命令可以帮助我们更高效地管理和操作文件。其中一个非常有用的命令是tail&#xff0c;用于查看文件的末尾内容。本篇博客将详细介绍tail命令的各种选项及其用法。 1. 命令概述 tail命令用于输出指定文件的…

基于深度学习的番茄叶片病害检测系统(含UI界面、yolov8、Python代码、数据集)

项目介绍 项目中所用到的算法模型和数据集等信息如下&#xff1a; 算法模型&#xff1a;     yolov8 yolov8主要包含以下几种创新&#xff1a;         1. 可以任意更换主干结构&#xff0c;支持几百种网络主干。 数据集&#xff1a;     网上下载的数据集&#x…

4、Generator、class类、继承、Set、Map、Promise

一、生成器函数Generator 1、声明generator函数 function* 函数名() { }调用生成器函数 需要next()返回一个generator对象&#xff0c;对象的原型上有一个next(),调用返回对象{value:yield后面的值,done} function* fn() {console.log("我是生成器函数") } let it…

KY276 Problem C

学会了处理1e9素数的新思路&#xff0c;但目前只供挑选最大质因子用 牛客刷题完结撒花&#xff01; 添加链接描述 #include<bits/stdc.h>using namespace std;#define int long long int n; string str;int zhi(int x){int ans 0;for(int i 2; i * i < x; i ){whi…