段子网站((Python爬虫)

一、糗事百科百度百科:糗事百科,是以糗友真实糗事为主题的笑话网站,话题轻松休闲,在年轻人中十分流行。“白领、学生”流行网上晒糗事,这些全都是该网站上网友写出来的自己的糗事、倒霉事,像某大酒店招男性服务员这样的事,在现实生活中根本无从开口,但在该网站上,网友可以自然地晒出来。二、完整代码
import requests,os
from lxml import etree

#头信息headers={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'}
}
#向网站发起请求
def get_text(url):
response = requests.get(url,headers=headers).text
return response
#分析网页
def nex_page_parse(url):
global count
response = requests.get(url, headers=headers).text
soup = etree.HTML(response)
title=soup.xpath('//*[@id="content"]/div/div[2]/h1/text()')
content=soup.xpath('//*[@id="single-next-link"]/div/text()')
res=[]
res=''.join(content)
print("正在抓取第%d条糗事…"%count)
print(url)
print(title[0].rstrip())
print(res)
with open('糗事百科.txt','a',encoding='utf-8') as f:
f.write(title[0]+res)
print("第%d条糗事抓取完毕!\n" %count)
count+=1def get_urls(html):
soup=etree.HTML(html)
tags=soup.xpath('//*[@id="content"]/div/div[2]')
urls=[]
for list in tags:
urls=list.xpath('./div/a[1]/@href')
return urls
#主函数
if __name__ == '__main__':
count = 1 #记录爬取得段子数
urls =[ 'https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(14)]
for url in urls:
text = get_text(url)
for url in get_urls(text):
url='https://www.qiushibaike.com'+url
nex_page_parse(url)

本文出自快速备案,转载时请注明出处及相应链接。

本文永久链接: https://kuaisubeian.cc/44628.html

kuaisubeian