使用python爬虫爬取链家潍坊市二手房项目

默默无闻の小常 2023-03-28 原文

使用python爬虫爬取链家潍坊市二手房项目

需求分析

需要将潍坊市各县市区页面所展示的二手房信息按要求爬取下来，同时保存到本地。

流程设计

明确目标网站URL（ https://wf.lianjia.com/ ）
确定爬取二手房哪些具体信息（字段名）
python爬虫关键实现：requests库和lxml库
将爬取的数据存储到CSV或数据库中

实现过程

项目目录

1、在数据库中创建数据表

我电脑上使用的是MySQL8.0，图形化工具用的是Navicat.
数据库字段对应
id-编号、title-标题、total_price-房屋总价、unit_price-房屋单价、
square-面积、size-户型、floor-楼层、direction-朝向、type-楼型、
district-地区、nearby-附近区域、community-小区、elevator-电梯有无、
elevatorNum-梯户比例、ownership-房屋性质
该图显示的是字段名、数据类型、长度等信息。

2、自定义数据存储函数

这部分代码放到Spider_wf.py文件中
通过write_csv函数将数据存入CSV文件，通过write_db函数将数据存入数据库

点击查看代码


import csv
import pymysql



#写入CSV
def write_csv(example_1):
    csvfile = open('二手房数据.csv', mode='a', encoding='utf-8', newline='')
    fieldnames = ['title', 'total_price', 'unit_price', 'square', 'size', 'floor','direction','type',
                  'BuildTime','district','nearby', 'community', 'decoration', 'elevator','elevatorNum','ownership']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writerow(example_1)

#写入数据库
def write_db(example_2):
    conn = pymysql.connect(host='127.0.0.1',port= 3306,user='changziru',
                           password='ru123321',database='secondhouse_wf',charset='utf8mb4'
                           )
    cursor =conn.cursor()
    title = example_2.get('title', '')
    total_price = example_2.get('total_price', '0')
    unit_price = example_2.get('unit_price', '')
    square = example_2.get('square', '')
    size = example_2.get('size', '')
    floor = example_2.get('floor', '')
    direction = example_2.get('direction', '')
    type = example_2.get('type', '')
    BuildTime = example_2.get('BuildTime','')
    district = example_2.get('district', '')
    nearby = example_2.get('nearby', '')
    community = example_2.get('community', '')
    decoration = example_2.get('decoration', '')
    elevator = example_2.get('elevator', '')
    elevatorNum = example_2.get('elevatorNum', '')
    ownership = example_2.get('ownership', '')
    cursor.execute('insert into wf (title, total_price, unit_price, square, size, floor,direction,type,BuildTime,district,nearby, community, decoration, elevator,elevatorNum,ownership)'
                   'values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)',
                   [title, total_price, unit_price, square, size, floor,direction,type,
                  BuildTime,district,nearby, community, decoration, elevator,elevatorNum,ownership])
    conn.commit()#传入数据库
    conn.close()#关闭数据库

3、爬虫程序实现

这部分代码放到lianjia_house.py文件，调用项目Spider_wf.py文件中的write_csv和write_db函数

点击查看代码

#爬取链家二手房详情页信息
import time
from random import randint
import requests
from lxml import etree
from secondhouse_spider.Spider_wf import write_csv,write_db

#模拟浏览器操作
USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
#随机USER_AGENTS
random_agent = USER_AGENTS[randint(0, len(USER_AGENTS) - 1)]
headers = {'User-Agent': random_agent,}

class SpiderFunc:
    def __init__(self):
        self.count = 0
    def spider(self ,list):
        for sh in list:
            response = requests.get(url=sh, params={'param':'1'},headers={'Connection':'close'}).text
            tree = etree.HTML(response)
            li_list = tree.xpath('//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]')
            for li in li_list:
                # 获取每套房子详情页的URL
                detail_url = li.xpath('.//div[@class="title"]/a/@href')[0]
                try:
                    # 向每个详情页发送请求
                    detail_response = requests.get(url=detail_url, headers={'Connection': 'close'}).text

                except Exception as e:
                    sleeptime = randint(15,30)
                    time.sleep(sleeptime)#随机时间延迟
                    print(repr(e))#打印异常信息
                    continue
                else:
                    detail_tree = etree.HTML(detail_response)
                    item = {}
                    title_list = detail_tree.xpath('//div[@class="title"]/h1/text()')
                    item['title'] = title_list[0] if title_list else None  # 1简介

                    total_price_list = detail_tree.xpath('//span[@class="total"]/text()')
                    item['total_price'] = total_price_list[0] if total_price_list else None  # 2总价

                    unit_price_list = detail_tree.xpath('//span[@class="unitPriceValue"]/text()')
                    item['unit_price'] = unit_price_list[0] if unit_price_list else None  # 3单价

                    square_list = detail_tree.xpath('//div[@class="area"]/div[@class="mainInfo"]/text()')
                    item['square'] = square_list[0] if square_list else None  # 4面积

                    size_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[1]/text()')
                    item['size'] = size_list[0] if size_list else None  # 5户型

                    floor_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[2]/text()')
                    item['floor'] = floor_list[0] if floor_list else None#6楼层

                    direction_list = detail_tree.xpath('//div[@class="type"]/div[@class="mainInfo"]/text()')
                    item['direction'] = direction_list[0] if direction_list else None  # 7朝向

                    type_list = detail_tree.xpath('//div[@class="area"]/div[@class="subInfo"]/text()')
                    item['type'] = type_list[0] if type_list else None  # 8楼型

                    BuildTime_list = detail_tree.xpath('//div[@class="transaction"]/div[@class="content"]/ul/li[5]/span[2]/text()')
                    item['BuildTime'] = BuildTime_list[0] if BuildTime_list else None  # 9房屋年限

                    district_list = detail_tree.xpath('//div[@class="areaName"]/span[@class="info"]/a[1]/text()')
                    item['district'] = district_list[0] if district_list else None  # 10地区

                    nearby_list = detail_tree.xpath('//div[@class="areaName"]/span[@class="info"]/a[2]/text()')
                    item['nearby'] = nearby_list[0] if nearby_list else None  # 11区域

                    community_list = detail_tree.xpath('//div[@class="communityName"]/a[1]/text()')
                    item['community'] = community_list[0] if community_list else None  # 12小区

                    decoration_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[9]/text()')
                    item['decoration'] = decoration_list[0] if decoration_list else None  # 13装修

                    elevator_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[11]/text()')
                    item['elevator'] = elevator_list[0] if elevator_list else None  # 14电梯

                    elevatorNum_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[10]/text()')
                    item['elevatorNum'] = elevatorNum_list[0] if elevatorNum_list else None  # 15梯户比例

                    ownership_list = detail_tree.xpath('//div[@class="transaction"]/div[@class="content"]/ul/li[2]/span[2]/text()')
                    item['ownership'] = ownership_list[0] if ownership_list else None  # 16交易权属
                    self.count += 1
                    print(self.count,title_list)

                    # 将爬取到的数据存入CSV文件
                    write_csv(item)
                    # 将爬取到的数据存取到MySQL数据库中
                    write_db(item)
#循环目标网站
count =0
for page in range(1,101):
    if page <=40:
        url_qingzhoushi = 'https://wf.lianjia.com/ershoufang/qingzhoushi/pg' + str(page)  # 青州市40
        url_hantingqu = 'https://wf.lianjia.com/ershoufang/hantingqu/pg' + str(page)  # 寒亭区 76
        url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page)  # 坊子区
        url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page)  # 奎文区
        url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page)  # 高新区
        url_jingji = 'https://wf.lianjia.com/ershoufang/jingjijishukaifaqu2/pg' + str(page)  # 经济技术85
        url_shouguangshi = 'https://wf.lianjia.com/ershoufang/shouguangshi/pg' + str(page)  # 寿光市 95
        url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page)  # 潍城区
        list_wf = [url_qingzhoushi, url_hantingqu,url_jingji, url_shouguangshi, url_weichengqu, url_fangzi, url_kuiwenqu, url_gaoxin]
        SpiderFunc().spider(list_wf)
    elif page <=76:
        url_hantingqu = 'https://wf.lianjia.com/ershoufang/hantingqu/pg' + str(page)  # 寒亭区 76
        url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page)  # 坊子区
        url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page)  # 奎文区
        url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page)  # 高新区
        url_jingji = 'https://wf.lianjia.com/ershoufang/jingjijishukaifaqu2/pg' + str(page)  # 经济技术85
        url_shouguangshi = 'https://wf.lianjia.com/ershoufang/shouguangshi/pg' + str(page)  # 寿光市 95
        url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page)  # 潍城区
        list_wf = [url_hantingqu,url_jingji, url_shouguangshi, url_weichengqu, url_fangzi, url_kuiwenqu, url_gaoxin]
        SpiderFunc().spider(list_wf)
    elif page<=85:
        url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page)  # 坊子区
        url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page)  # 奎文区
        url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page)  # 高新区
        url_jingji = 'https://wf.lianjia.com/ershoufang/jingjijishukaifaqu2/pg' + str(page)  # 经济技术85
        url_shouguangshi = 'https://wf.lianjia.com/ershoufang/shouguangshi/pg' + str(page)  # 寿光市 95
        url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page)  # 潍城区
        list_wf = [url_jingji, url_shouguangshi, url_weichengqu, url_fangzi, url_kuiwenqu, url_gaoxin]
        SpiderFunc().spider(list_wf)
    elif page <=95:
        url_shouguangshi = 'https://wf.lianjia.com/ershoufang/shouguangshi/pg' + str(page)  # 寿光市 95
        url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page)  # 潍城区
        url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page)  # 坊子区
        url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page)  # 奎文区
        url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page)  # 高新区
        list_wf = [url_shouguangshi, url_weichengqu, url_fangzi, url_kuiwenqu, url_gaoxin]
        SpiderFunc().spider(list_wf)
    else:
        url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page)  # 潍城区
        url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page)  # 坊子区
        url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page)  # 奎文区
        url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page)  # 高新区
        list_wf = [url_weichengqu, url_fangzi,url_kuiwenqu, url_gaoxin]
        SpiderFunc().spider(list_wf)

4、效果展示

总共获取到20826条数据，
我数据库因为要做数据分析，因而作了预处理，获得18031条

有关使用python爬虫爬取链家潍坊市二手房项目的更多相关文章

ruby - 如何使用 Nokogiri 的 xpath 和 at_xpath 方法 - 2
我正在学习如何使用Nokogiri，根据这段代码我遇到了一些问题:require'rubygems'require'mechanize'post_agent=WWW::Mechanize.newpost_page=post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')puts"\nabsolutepathwithtbodygivesnil"putspost_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div
ruby - 使用 RubyZip 生成 ZIP 文件时设置压缩级别 - 2
我有一个Ruby程序，它使用rubyzip压缩XML文件的目录树。gem。我的问题是文件开始变得很重，我想提高压缩级别，因为压缩时间不是问题。我在rubyzipdocumentation中找不到一种为创建的ZIP文件指定压缩级别的方法。有人知道如何更改此设置吗？是否有另一个允许指定压缩级别的Ruby库？最佳答案这是我通过查看rubyzip内部创建的代码。level=Zlib::BEST_COMPRESSIONZip::ZipOutputStream.open(zip_file)do|zip|Dir.glob("**/*")d
ruby - 为什么我可以在 Ruby 中使用 Object#send 访问私有(private)/ protected 方法？ - 2
类classAprivatedeffooputs:fooendpublicdefbarputs:barendprivatedefzimputs:zimendprotecteddefdibputs:dibendendA的实例a=A.new测试a.foorescueputs:faila.barrescueputs:faila.zimrescueputs:faila.dibrescueputs:faila.gazrescueputs:fail测试输出failbarfailfailfail.发送测试[:foo,:bar,:zim,:dib,:gaz].each{|m|a.send(m)resc
ruby-on-rails - 使用 Ruby on Rails 进行自动化测试 - 最佳实践 - 2
很好奇，就使用rubyonrails自动化单元测试而言，你们正在做什么？您是否创建了一个脚本来在cron中运行rake作业并将结果邮寄给您？git中的预提交Hook？只是手动调用？我完全理解测试，但想知道在错误发生之前捕获错误的最佳实践是什么。让我们理所当然地认为测试本身是完美无缺的，并且可以正常工作。下一步是什么以确保他们在正确的时间将可能有害的结果传达给您？最佳答案不确定您到底想听什么，但是有几个级别的自动代码库控制:在处理某项功能时，您可以使用类似autotest的内容获得关于哪些有效，哪些无效的即时反馈。要确保您的提
ruby - 在 Ruby 中使用匿名模块 - 2
假设我做了一个模块如下:m=Module.newdoclassCendend三个问题:除了对m的引用之外，还有什么方法可以访问C和m中的其他内容？我可以在创建匿名模块后为其命名吗(就像我输入“module...”一样)？如何在使用完匿名模块后将其删除，使其定义的常量不再存在？最佳答案三个答案:是的，使用ObjectSpace.此代码使c引用你的类(class)C不引用m:c=nilObjectSpace.each_object{|obj|c=objif(Class===objandobj.name=~/::C$/)}当然这取决于
ruby - 使用 ruby 和 savon 的 SOAP 服务 - 2
我正在尝试使用ruby和Savon来使用网络服务。测试服务为http://www.webservicex.net/WS/WSDetails.aspx?WSID=9&CATID=2require'rubygems'require'savon'client=Savon::Client.new"http://www.webservicex.net/stockquote.asmx?WSDL"client.get_quotedo|soap|soap.body={:symbol=>"AAPL"}end返回SOAP异常。检查soap信封，在我看来soap请求没有正确的命名空间。任何人都可以建议我
python - 如何使用 Ruby 或 Python 创建一系列高音调和低音调的蜂鸣声？ - 2
关闭。这个问题是opinion-based.它目前不接受答案。想要改进这个问题？更新问题，以便editingthispost可以用事实和引用来回答它.关闭4年前。Improvethisquestion我想在固定时间创建一系列低音和高音调的哔哔声。例如:在150毫秒时发出高音调的蜂鸣声在151毫秒时发出低音调的蜂鸣声200毫秒时发出低音调的蜂鸣声250毫秒的高音调蜂鸣声有没有办法在Ruby或Python中做到这一点？我真的不在乎输出编码是什么(.wav、.mp3、.ogg等等)，但我确实想创建一个输出文件。
ruby-on-rails - 'compass watch' 是如何工作的/它是如何与 rails 一起使用的 - 2
我在我的项目目录中完成了compasscreate.和compassinitrails。几个问题:我已将我的.sass文件放在public/stylesheets中。这是放置它们的正确位置吗？当我运行compasswatch时，它不会自动编译这些.sass文件。我必须手动指定文件:compasswatchpublic/stylesheets/myfile.sass等。如何让它自动运行？文件ie.css、print.css和screen.css已放在stylesheets/compiled。如何在编译后不让它们重新出现的情况下删除它们？我自己编译的.sass文件编译成compiled/t
ruby - 使用 ruby 将 HTML 转换为纯文本并维护结构/格式 - 2
我想将html转换为纯文本。不过，我不想只删除标签，我想智能地保留尽可能多的格式。为插入换行符标签，检测段落并格式化它们等。输入非常简单，通常是格式良好的html(不是整个文档，只是一堆内容，通常没有anchor或图像)。我可以将几个正则表达式放在一起，让我达到80%，但我认为可能有一些现有的解决方案更智能。最佳答案首先，不要尝试为此使用正则表达式。很有可能你会想出一个脆弱/脆弱的解决方案，它会随着HTML的变化而崩溃，或者很难管理和维护。您可以使用Nokogiri快速解析HTML并提取文本:require'nokogiri'h
ruby - 在 64 位 Snow Leopard 上使用 rvm、postgres 9.0、ruby 1.9.2-p136 安装 pg gem 时出现问题 - 2
我想为Heroku构建一个Rails3应用程序。他们使用Postgres作为他们的数据库，所以我通过MacPorts安装了postgres9.0。现在我需要一个postgresgem并且共识是出于性能原因你想要pggem。但是我对我得到的错误感到非常困惑当我尝试在rvm下通过geminstall安装pg时。我已经非常明确地指定了所有postgres目录的位置可以找到但仍然无法完成安装:$envARCHFLAGS='-archx86_64'geminstallpg--\--with-pg-config=/opt/local/var/db/postgresql90/defaultdb/po

使用python爬虫爬取链家潍坊市二手房项目