散文網(wǎng) » 生活 »日常 » 畢業(yè)設(shè)計大數(shù)據(jù)房價預(yù)測分析與可視化系統(tǒng)

畢業(yè)設(shè)計大數(shù)據(jù)房價預(yù)測分析與可視化系統(tǒng)

2023-03-10 10:15 作者:丹成學(xué)長 0人讀過 | 我要投稿

0 前言

?? 這兩年開始畢業(yè)設(shè)計和畢業(yè)答辯的要求和難度不斷提升，傳統(tǒng)的畢設(shè)題目缺少創(chuàng)新和亮點(diǎn)，往往達(dá)不到畢業(yè)答辯的要求，這兩年不斷有學(xué)弟學(xué)妹告訴學(xué)長自己做的項目系統(tǒng)達(dá)不到老師的要求。

為了大家能夠順利以及最少的精力通過畢設(shè)，學(xué)長分享優(yōu)質(zhì)畢業(yè)設(shè)計項目，今天要分享的是

?? ?大數(shù)據(jù)房價預(yù)測分析與可視

??學(xué)長這里給一個題目綜合評分(每項滿分5分)

難度系數(shù)：3分
工作量：3分
創(chuàng)新點(diǎn)：3分

畢設(shè)幫助，選題指導(dǎo)，技術(shù)解答，歡迎打擾，見B站個人主頁

https://space.bilibili.com/33886978

1 課題背景

Ames數(shù)據(jù)集包含來自Ames評估辦公室的2930條記錄。該數(shù)據(jù)集具有23個定類變量，23個定序變量，14個離散變量和20個連續(xù)變量（以及2個額外的觀察標(biāo)識符） - 總共82個特征。可以在包含的codebook.txt文件中找到每個變量的說明。該信息用于計算2006年至2010年在愛荷華州艾姆斯出售的個別住宅物業(yè)的評估價值。實(shí)際銷售價格中增加了一些噪音，因此價格與官方記錄不符。

分別分為訓(xùn)練和測試集，分別為2000和930個觀測值。在測試集中保留實(shí)際銷售價格。此外，測試數(shù)據(jù)進(jìn)一步分為公共和私有測試集。

本次練習(xí)需要圍繞以下目的進(jìn)行：

理解問題 ：觀察每個變量特征的意義以及對于問題的重要程度
研究主要特征 ：也就是最終的目的變量----房價
研究其他變量 ：研究其他多變量對“房價”的影響的他們之間的關(guān)系
基礎(chǔ)的數(shù)據(jù)清理 ：對一些缺失數(shù)據(jù)、異常點(diǎn)和分類數(shù)據(jù)進(jìn)行處理
擬合模型： 建立一個預(yù)測房屋價值的模型，并且準(zhǔn)確預(yù)測房價

2 導(dǎo)入相關(guān)的數(shù)據(jù)

1.導(dǎo)入相關(guān)的python包

import numpy as np

import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import linear_model as lm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

2. 導(dǎo)入訓(xùn)練數(shù)據(jù)集和測試數(shù)據(jù)集

training_data = pd.read_csv("ames_train.csv")
test_data = pd.read_csv("ames_test.csv")
pd.set_option('display.max_columns', None)
#顯示所有行
pd.set_option('display.max_rows', None)
#設(shè)置value的顯示長度為100，默認(rèn)為50
pd.set_option('max_colwidth',100)
training_data.head(7)

3 觀察各項主要特征與房屋售價的關(guān)系

該數(shù)據(jù)集具有46個類別型變量，34個數(shù)值型變量，整理到excel表格中，用于篩選與房價息息相關(guān)的變量。從中篩選出以下幾個與房價相關(guān)的變量：

類別型變量：

Utilities : 可用設(shè)施（電、天然氣、水）
Heating (Nominal): 暖氣類型
Central Air (Nominal): 是否有中央空調(diào)
Garage Type (Nominal): 車庫位置
Neighborhood (Nominal): Ames市區(qū)內(nèi)的物理位置（地圖地段）
Overall Qual (Ordinal): 評估房屋的整體材料和光潔度

數(shù)值型變量：

Lot Area（Continuous）：地皮面積（平方英尺）
Gr Liv Area (Continuous): 地面以上居住面積平方英尺
Total Bsmt SF (Continuous): 地下面積的總面積
TotRmsAbvGrd (Discrete): 地面上全部房間數(shù)目

分析最重要的變量"SalePrice"

training_data['SalePrice'].describe()

從上面的描述性統(tǒng)計可以看出房價的平均值、標(biāo)準(zhǔn)差、最小值、25%分位數(shù)、50%分位數(shù)、75%分位數(shù)、最大值等，并且SalePrice沒有無效或者其他非數(shù)值的數(shù)據(jù)。

#繪制"SalePrice"的直方圖
sns.distplot(training_data['SalePrice'])
#計算峰度和偏度
print("Skewness: %f" % training_data['SalePrice'].skew())
print("Kurtosis: %f" % training_data['SalePrice'].kurt())

從直方圖中可以看出"SalePrice"成正態(tài)分布，峰度為4.838055，偏度為1.721408，比正態(tài)分布的高峰更加陡峭，偏度為右偏，長尾拖在右邊。

2.類別型變量

（1）Utilities與SalePrice

Utilities (Ordinal): Type of utilities available

AllPub All public Utilities (E,G,W,& S)

NoSewr Electricity, Gas, and Water (Septic Tank)

NoSeWa Electricity and Gas Only

ELO Electricity only

#類別型變量
#1.Utilities
var = 'Utilities'
data = pd.concat([training_data['SalePrice'], training_data[var]], axis=1)
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)

從圖中可以看出，配備全套設(shè)施（水、電、天然氣）的房子價格普遍偏高

（2）Heating與SalePrice

Heating (Nominal): Type of heating

Floor Floor Furnace

GasA Gas forced warm air furnace

GasW Gas hot water or steam heat

Grav Gravity furnace

OthW Hot water or steam heat other than gas

Wall Wall furnace

#2.Heating
var = 'Heating'
data = pd.concat([training_data['SalePrice'], training_data[var]], axis=1)
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)

從圖中可以看出擁有GasA、GasW的房子價格較高，并且有GasA的房子價格變動較大，房屋價格較高的房子一般都有GasA制暖裝置。

（3）Central_Air與SalePrice

#3.Central_Air
var = 'Central_Air'
data = pd.concat([training_data['SalePrice'], training_data[var]], axis=1)
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)

由中央空調(diào)的房子能給用戶更好的體驗(yàn)，因此一般價格較高，房屋價格較高的房子一般都有中央空調(diào)。

（4）Gabage_type與SalePrice

Garage Type (Nominal): Garage location

2Types More than one type of garage

Attchd Attached to home

Basment Basement Garage

BuiltIn Built-In (Garage part of house - typically has room above garage)

CarPort Car Port

Detchd Detached from home

NA No Garage

#4.Gabage_type
var = 'Garage_Type'
data = pd.concat([training_data['SalePrice'], training_data[var]], axis=1)
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)

車庫越便捷，一般房屋價格越高，臨近房屋以及房屋內(nèi)置的車庫這兩種價格較高。

（5）Neighborhood與SalePrice

Neighborhood為房屋位于Ames市內(nèi)的具體的地段，越臨近繁華市區(qū)、旅游風(fēng)景區(qū)、科技園區(qū)、學(xué)園區(qū)的房屋，房屋價格越貴

#5.Neighborhood
fig, axs = plt.subplots(nrows=2)

sns.boxplot(
? ?x='Neighborhood',
? ?y='SalePrice',
? ?data=training_data.sort_values('Neighborhood'),
? ?ax=axs[0]
)

sns.countplot(
? ?x='Neighborhood',
? ?data=training_data.sort_values('Neighborhood'),
? ?ax=axs[1]
)

# Draw median price
axs[0].axhline(
? ?y=training_data['SalePrice'].median(),
? ?color='red',
? ?linestyle='dotted'
)

# Label the bars with counts
for patch in axs[1].patches:
? ?x = patch.get_bbox().get_points()[:, 0]
? ?y = patch.get_bbox().get_points()[1, 1]
? ?axs[1].annotate(f'{int(y)}', (x.mean(), y), ha='center', va='bottom')
? ?
# Format x-axes
axs[1].set_xticklabels(axs[1].xaxis.get_majorticklabels(), rotation=90)
axs[0].xaxis.set_visible(False)

# Narrow the gap between the plots
plt.subplots_adjust(hspace=0.01)

從上圖結(jié)果可以看出，我們訓(xùn)練數(shù)據(jù)集中Neighborhood這一列數(shù)據(jù)不均勻，NAmes有299條數(shù)據(jù)，而Blueste只有4條數(shù)據(jù)，Gilbert只有6條數(shù)據(jù)，GmHill只有2條數(shù)據(jù)，這樣造成數(shù)據(jù)沒那么準(zhǔn)確。

（6）Overall Qual 與SalePrice

總體評價越高，應(yīng)該房屋的價格越高

#Overall Qual
var = 'Overall_Qual'
data = pd.concat([training_data['SalePrice'], training_data[var]], axis=1)
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)

3.數(shù)值型變量

（1） Lot Area與SalePrice

#數(shù)值型變量
#1.Lot Area
sns.jointplot(
? ?x='Lot_Area',
? ?y='SalePrice',
? ?data=training_data,
? ?stat_func=None,
? ?kind="reg",
? ?ratio=4,
? ?space=0,
? ?scatter_kws={
? ? ? ?'s': 3,
? ? ? ?'alpha': 0.25
? ?},
? ?line_kws={
? ? ? ?'color': 'black'
? ?}
)

看起來沒有什么明顯的趨勢，散點(diǎn)圖主要集中在前半部分，不夠分散

（2）Gr_Liv_Area與SalePrice

Gr_Liv_Area代表建筑在土地上的房屋的面積

猜測兩者應(yīng)該成正相關(guān)，即房屋面積越大，房屋的價格越高

sns.jointplot(
? ?x='Gr_Liv_Area',
? ?y='SalePrice',
? ?data=training_data,
? ?stat_func=None,
? ?kind="reg",
? ?ratio=4,
? ?space=0,
? ?scatter_kws={
? ? ? ?'s': 3,
? ? ? ?'alpha': 0.25
? ?},
? ?line_kws={
? ? ? ?'color': 'black'
? ?}
)

結(jié)果：兩者的確呈現(xiàn)正相關(guān)的線性關(guān)系，發(fā)現(xiàn)Gr_ _Liv__ Area中有處于5000以上的異常值

編寫函數(shù)，將5000以上的Gr_ _Liv__ Area異常值移除

def remove_outliers(data, variable, lower=-np.inf, upper=np.inf):
? ?"""
? ?Input:
? ? ?data (data frame): the table to be filtered
? ? ?variable (string): the column with numerical outliers
? ? ?lower (numeric): observations with values lower than this will be removed
? ? ?upper (numeric): observations with values higher than this will be removed
? ?
? ?Output:
? ? ?a winsorized data frame with outliers removed
? ?"""
? ?data=data[(data[variable]>lower)&(data[variable]

再次繪圖

兩者的確呈現(xiàn)正相關(guān)的線性關(guān)系

（3）Total_Bsmt_SF與SalePrice

#3.Total Bsmt SF
sns.jointplot(
? ?x='Total_Bsmt_SF',
? ?y='SalePrice',
? ?data=training_data,
? ?stat_func=None,
? ?kind="reg",
? ?ratio=4,
? ?space=0,
? ?scatter_kws={
? ? ? ?'s': 3,
? ? ? ?'alpha': 0.25
? ?},
? ?line_kws={
? ? ? ?'color': 'black'
? ?}
)

(4)TotRms_AbvGrd與SalePrice

#4.TotRmsAbvGrd
sns.jointplot(
? ?x='TotRms_AbvGrd',
? ?y='SalePrice',
? ?data=training_data,
? ?stat_func=None,
? ?kind="reg",
? ?ratio=4,
? ?space=0,
? ?scatter_kws={
? ? ? ?'s': 3,
? ? ? ?'alpha': 0.25
? ?},
? ?line_kws={
? ? ? ?'color': 'black'
? ?}
)

4. 繪制相關(guān)性矩陣

#繪制相關(guān)性矩陣
corrmat = training_data.corr()
f, ax = plt.subplots(figsize=(40, 20))
sns.heatmap(corrmat, vmax=0.8,square=True,cmap="PiYG",center=0.0)

其中數(shù)值型變量中，Overall_Qual（房屋的整體評價）、Year_Built（房屋建造年份）、Year_Remod/Add（房屋整修年份）、Mas Vnr Area（房屋表層砌體模型）、Total_ Bsmt__ SF（地下總面積）、1stFlr_SF（一樓總面積） _、 Gr_ L iv_Area（地上居住面積）、Garage_Cars （車庫數(shù)量）、Garage_Area（車庫面積）都與呈正相關(guān)

最后從Year_Built（房屋建造年份）、Year_Remod/Add（房屋整修年份）中選取Year_Built，從1stFlr_SF（一樓總面積）、 Gr_ L iv_Area（地上居住面積）中選取Gr_ L iv_Area，從Garage_Cars （車庫數(shù)量）、Garage_Area（車庫面積）中選取Garage_Cars （車庫數(shù)量）。

6. 擬合模型

sklearn中的回歸有多種方法，廣義線性回歸集中在linear_model庫下，例如普通線性回歸、Lasso、嶺回歸等；另外還有其他非線性回歸方法，例如核svm、集成方法、貝葉斯回歸、K近鄰回歸、決策樹回歸、隨機(jī)森林回歸方法等，通過測試各個算法的

（1）加載相應(yīng)包

#擬合數(shù)據(jù)
from sklearn import preprocessing
from sklearn import linear_model, svm, gaussian_process
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import train_test_split
import numpy as np

（2）查看各列缺失值

?#查看各列缺失值
? ?print(training_data.Overall_Qual.isnull().any())
? ?print(training_data.Gr_Liv_Area.isnull().any())
? ?print(training_data.Garage_Cars.isnull().any())
? ?print(training_data.Total_Bsmt_SF.isnull().any())
? ?print(training_data.Year_Built.isnull().any())
? ?print(training_data.Mas_Vnr_Area.isnull().any())

發(fā)現(xiàn)Total_Bsmt_SF和Mas_Vnr_Area兩列有缺失值

? #用均值填補(bǔ)缺失值
? ?training_data.Total_Bsmt_SF=training_data.Total_Bsmt_SF.fillna(training_data.Total_Bsmt_SF.mean())
? ?training_data.Mas_Vnr_Area=training_data.Mas_Vnr_Area.fillna(training_data.Mas_Vnr_Area.mean())
? ?print(training_data.Total_Bsmt_SF.isnull().any())
? ?print(training_data.Mas_Vnr_Area.isnull().any())

（3）擬合模型

? # 獲取數(shù)據(jù)
? ?from sklearn import metrics
? ?cols = ['Overall_Qual','Gr_Liv_Area', 'Garage_Cars','Total_Bsmt_SF', 'Year_Built','Mas_Vnr_Area']
? ?x = training_data[cols].values
? ?y = training_data['SalePrice'].values
? ?X_train,X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
? ?
? ?clf = RandomForestRegressor(n_estimators=400)
? ?clf.fit(X_train, y_train)
? ?y_pred = clf.predict(X_test)
? ?計算MSE：
? ?print(metrics.mean_squared_error(y_test,y_pred))

（4）繪制預(yù)測結(jié)果的散點(diǎn)圖

import numpy as np
x = np.random.rand(660)
plt.scatter(x,y_test, alpha=0.5)
plt.scatter(x,y_pred, alpha=0.5,color="G")

（5）加載測試集數(shù)據(jù)

test_data=pd.read_csv("ames_test.csv")
test_data.head(5)

查看缺失值

#查看各列缺失值
print(test_data.Overall_Qual.isnull().any())
print(test_data.Gr_Liv_Area.isnull().any())
print(test_data.Garage_Cars.isnull().any())
print(test_data.Total_Bsmt_SF.isnull().any())
print(test_data.Year_Built.isnull().any())
print(test_data.Mas_Vnr_Area.isnull().any())

? ? ?

#用均值填補(bǔ)缺失值
test_data.Garage_Cars=training_data.Garage_Cars.fillna(training_data.Garage_Cars.mean())
print(test_data.Garage_Cars.isnull().any())

(6)預(yù)測測試集的房價

? ?#預(yù)測
? ?cols = ['Overall_Qual','Gr_Liv_Area', 'Garage_Cars','Total_Bsmt_SF', 'Year_Built','Mas_Vnr_Area']
? ?x_test_value= test_data[cols].values
? ?test_pre=clf.predict(x_test_value)
? ?#寫入文件
? ?prediction = pd.DataFrame(test_pre, columns=['SalePrice'])
? ?result = pd.concat([test_data['Id'], prediction], axis=1)
? ?result.to_csv('./Predictions.csv', index=False)

?test_data.Garage_Cars=training_data.Garage_Cars.fillna(training_data.Garage_Cars.mean())
? ?print(test_data.Garage_Cars.isnull().any())

(6)預(yù)測測試集的房價

#預(yù)測
cols = ['Overall_Qual','Gr_Liv_Area', 'Garage_Cars','Total_Bsmt_SF', 'Year_Built','Mas_Vnr_Area']
x_test_value= test_data[cols].values
test_pre=clf.predict(x_test_value)
#寫入文件
prediction = pd.DataFrame(test_pre, columns=['SalePrice'])
result = pd.concat([test_data['Id'], prediction], axis=1)
result.to_csv('./Predictions.csv', index=False)

畢設(shè)幫助，選題指導(dǎo)，技術(shù)解答，歡迎打擾，見B站個人主頁

https://space.bilibili.com/33886978

標(biāo)簽：