哪種大模型是程序員的福音?ChatGPT并非最佳Copilot 成功“弒神”
我們都知道ChatGPT4的綜合能力肯定是大于ChatGPT3.5。
1:那么官方提供的 GPT3.5 API 能力和官方的 GPT3.5 能力相當(dāng)嗎?
(友情提示:ChatGPT3.5和ChatGPT3.5API是不一樣的)
2:GPT4.0和GPT4.0配合聯(lián)網(wǎng)搜索哪個(gè)效果好?
3:另外, Github 的 copilot chat 能力又怎樣?
帶著這些問題,我們用一個(gè)簡(jiǎn)單的編程問題,逐個(gè)進(jìn)行測(cè)試,看看他們的表現(xiàn)如何?
編程問題:
幫我寫一段python完整的代碼,我希望用最快的效率爬取10000個(gè)網(wǎng)頁,并存儲(chǔ)在本地的MySQL數(shù)據(jù)庫。
ChatGPT3.5 API
import requests
from bs4 import BeautifulSoup
import mysql.connector
# 設(shè)置MySQL數(shù)據(jù)庫連接
db = mysql.connector.connect(
host="localhost",
user="your_username",
password="your_password",
database="your_database")
cursor = db.cursor()
# 爬取網(wǎng)頁并存儲(chǔ)到數(shù)據(jù)庫
def crawl_and_store(url):
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 解析網(wǎng)頁內(nèi)容并提取需要存儲(chǔ)的數(shù)據(jù)
# 這里只是一個(gè)示例,你需要根據(jù)具體的網(wǎng)頁結(jié)構(gòu)進(jìn)行解析 title = soup.title.text.strip()
content = soup.find('div', class_='content').text.strip()
# 將數(shù)據(jù)插入到數(shù)據(jù)庫中
query = "INSERT INTO webpages (url, title, content) VALUES (%s, %s, %s)"
values = (url, title, content)
cursor.execute(query, values)
db.commit()
print(f"成功爬取并存儲(chǔ)網(wǎng)頁: {url}")
except Exception as e:
print(f"爬取網(wǎng)頁失敗: {url}")
print(f"錯(cuò)誤信息: {str(e)}")
# 待爬取的網(wǎng)頁列表
urls = ["https://example.com/page1","
https://example.com/page2",# 添加更多的URL...]
# 遍歷網(wǎng)頁列表進(jìn)行爬取和存儲(chǔ)for url in urls: crawl_and_store(url)
# 關(guān)閉數(shù)據(jù)庫連接cursor.close()db.close()
總結(jié):
GPT3.5 API的代碼寫的簡(jiǎn)單明了。沒什么大問題,但是卻忽略了我的要求,我要求需要以最快的效率,但是它忽略了我的要求,依然用了最簡(jiǎn)單方法寫了代碼。
ChatGPT3.5
import requests
import mysql.connector
from concurrent.futures import ThreadPoolExecutor
# 創(chuàng)建數(shù)據(jù)庫連接
db = mysql.connector.connect(
host="localhost",
user="your_username",
password="your_password",
database="your_database")
cursor = db.cursor()
# 創(chuàng)建表格
cursor.execute("CREATE TABLE IF NOT EXISTS web_pages (id INT AUTO_INCREMENT PRIMARY KEY, url VARCHAR(255), content TEXT)")
def fetch_and_store(url):
try:
response = requests.get(url)
content = response.text
# 存儲(chǔ)到數(shù)據(jù)庫
sql = "INSERT INTO web_pages (url, content) VALUES (%s, %s)"
val = (url, content)
cursor.execute(sql, val)
db.commit()
print(f"成功爬取并存儲(chǔ)網(wǎng)頁: {url}")
except Exception as e:
print(f"爬取網(wǎng)頁 {url} 失敗: {str(e)}")
# 要爬取的網(wǎng)頁列表urls = ["https://example.com/page1", "https://example.com/page2", ...] # 添加更多的URL
# 創(chuàng)建線程池
executor = ThreadPoolExecutor(max_workers=10) # 設(shè)置適當(dāng)?shù)木€程數(shù)
# 使用線程池并發(fā)地爬取和存儲(chǔ)網(wǎng)頁for url in urls: executor.submit(fetch_and_store, url)
# 等待所有任務(wù)完成executor.shutdown()
# 關(guān)閉數(shù)據(jù)庫連接cursor.close()db.close()
總結(jié):
ChatGPT3.5的網(wǎng)頁版本是比API的效果要好的。有兩點(diǎn),第一,它分析出來,我在插入MySQL的之前是要?jiǎng)?chuàng)建表的,并且它注意到了我強(qiáng)調(diào)了要高效率的實(shí)現(xiàn),代碼也引入了多線程。
ChatGPT4
import concurrent.futures
import requests
import pymysql.cursors
# TODO: 替換為實(shí)際的網(wǎng)址列表
url_list = ["http://www.example.com/1", "http://www.example.com/2", "..."]
# 爬取單個(gè)網(wǎng)頁的函數(shù)
def fetch_url(url):
response = requests.get(url)
return url, response.text
# 連接到 MySQL
connection = pymysql.connect(host='localhost', user='your_username', password='your_password',
database='your_database',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:# 創(chuàng)建一個(gè)線程池
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
# 使用線程池異步地爬取所有網(wǎng)頁
future_to_url = {executor.submit(fetch_url, url): url for url in url_list}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]try: data = future.result()except Exception as exc: print('%r generated an exception: %s' % (url, exc))else: print('%r page is %d bytes' % (url, len(data[1])))
# 存儲(chǔ)到數(shù)據(jù)庫
try:with connection.cursor() as cursor:
sql = "INSERT INTO `your_table` (`url`, `content`) VALUES (%s, %s)" cursor.execute(sql, (data[0], data[1]))
connection.commit()
except Exception as error: print('Failed to update database: {}'.format(error))
finally: connection.close()
總結(jié):
ChatGPT4比ChatGPT3.5又升級(jí)了一版,理解了我要高效率的意思,不僅用了多線程還在注釋里面強(qiáng)調(diào):使用線程池異步地爬取所有網(wǎng)頁。異步使得程序的抓取效率又上升了一個(gè)檔次,而且還考慮到插入表的時(shí)候會(huì)報(bào)錯(cuò)。還打印出來了報(bào)錯(cuò)信息??芍^是非常規(guī)范了。
科大星火
import requests
from bs4 import BeautifulSoup
import pymysql
# 連接MySQL數(shù)據(jù)庫
def connect_db():
db = pymysql.connect(host='localhost', user='root', password='your_password', database='your_database', charset='utf8')
return db
# 爬取網(wǎng)頁內(nèi)容
def get_html(url):
try:
response = requests.get(url)
response.raise_for_status()
response.encoding = response.apparent_encoding
return response.text
except:
print("獲取網(wǎng)頁失敗")
return None
# 解析網(wǎng)頁內(nèi)容
def parse_html(html):
soup = BeautifulSoup(html, 'html.parser')
# 根據(jù)網(wǎng)頁結(jié)構(gòu)提取所需信息,這里需要根據(jù)實(shí)際情況修改
titles = soup.find_all('h1')
for title in titles:
yield title.text
# 將數(shù)據(jù)存儲(chǔ)到MySQL數(shù)據(jù)庫
def save_to_db(data):
db = connect_db()
cursor = db.cursor()
sql = "INSERT INTO your_table (title) VALUES (%s)"
try:
cursor.executemany(sql, data)
db.commit()
except:
db.rollback()
print("保存數(shù)據(jù)失敗")
finally:
cursor.close()
db.close()
if __name__ == '__main__':
urls = ['https://www.example.com/page{}'.format(i) for i in range(1, 10001)]
data = []
for url in urls:
html = get_html(url)
if html:
parsed_data = parse_html(html)
data.extend(parsed_data)
if len(data) >= 10000:
save_to_db(data)
data = []
else:
print("跳過無效網(wǎng)頁")
continue
if data:
save_to_db(data)
總結(jié):
基本效果等于ChatGPT3.5API
Copilot
import asyncioimport aiohttpimport pymysqlimport aiomysql
# MySQL數(shù)據(jù)庫配置信息DB_SETTINGS = {"host": "localhost","database": "your_database","user": "your_username","password": "your_password",}
# 你需要爬取的URL列表URLS = ["http://www.example.com"] * 10000
async def fetch(session, url):async with session.get(url) as response:return await response.text()
async def save_to_db(pool, url, content):async with pool.acquire() as conn:async with conn.cursor() as cur:# 根據(jù)你的數(shù)據(jù)表結(jié)構(gòu),寫入具體的SQL語句 insert_sql = "INSERT INTO table_name(url, content) VALUES(%s, %s)"await cur.execute(insert_sql, (url, content))await conn.commit()
async def fetch_and_save(pool, session, url): content = await fetch(session, url)await save_to_db(pool, url, content)
async def main():async with aiohttp.ClientSession() as session: pool = await aiomysql.create_pool(**DB_SETTINGS) tasks = []for url in URLS: task = asyncio.create_task(fetch_and_save(pool, session, url)) tasks.append(task)await asyncio.gather(*tasks)
if __name__ == "__main__": asyncio.run(main())
總結(jié):
Github 的 copilot chat給了我心中最好的答案,采用asyncio和aiohttp,甚至使用了aiomysql,全程異步,最大程度的保證了代碼的效率。代碼寫的也很優(yōu)雅。綜合評(píng)分如上圖。