手機站首頁散文詩歌雜文隨筆日記小小說

散文網(wǎng) » 生活 »日常 » R語言定量方法：回歸，虛擬變量和交互項，假設(shè)檢驗:F 檢驗、AIC 和 BIC分析學生成績數(shù)

R語言定量方法：回歸，虛擬變量和交互項，假設(shè)檢驗:F 檢驗、AIC 和 BIC分析學生成績數(shù)

2022-07-24 11:26 作者:拓端tecdat 0人讀過 | 我要投稿

全文鏈接：http://tecdat.cn/?p=27578?

原文出處：拓端數(shù)據(jù)部落公眾號

回歸假設(shè)

省略變量偏差

如果真實模型包括X?1 和X?2 ，但我們忘記了X?2，那么 - 在某些情況下 - 對X的估計將會有偏差。OVB 需要：cor(?X?1,?X?2)！= 0 和 cor(?X?1,?y?)?！= 0

同方差性

為了做出有效的推斷，我們假設(shè)誤差方差是恒定的 - 如果不是，我們冒著做出錯誤推斷的風險（沒有偏差，只影響 SE，補救措施：穩(wěn)健的 SE）

內(nèi)生性

如果X影響Y但Y也影響X，則我們具有內(nèi)生性，這將導致估計量有偏。

虛擬變量和交互

虛擬變量

可以取兩個值的變量，例如分數(shù)（小班、大班），也稱為指示變量或二元變量。

當我們估計這個模型時會發(fā)生什么？

值i???=?β?0?+?β?1大i?+?ε?i

yi??????????=?β0?+?β1di?+?εi

小班的估計是多少？

大班的估計是多少？

示例：學校數(shù)據(jù)

小班的期望分數(shù)是多少？
? β^0

大班的預(yù)期分數(shù)是多少？
? β^0 + β^1 ?

小班和大班之間的期望差異是什么？

? β^1
?

> summary(mol.mll)

虛擬變量與回歸

當我們將虛擬變量添加到具有連續(xù)解釋變量的模型時會發(fā)生什么？

yi?=?β0?+?β1xi?+?εi

yi?=?β0?+?β1xi?+?β2d?+?εi

如果大班d?= 1，小班d?= 0，我們得到大班：

?

?對于小班，我們得到這個：

插圖

學校數(shù)據(jù)

> del <- lm(tetcr ~ Sraio + igscol, data=dt1)
> summary(me2)

?

一個學生對每個老師的邊際效應(yīng)是多少？

?βSTR比

大班有什么影響？

??β ^大班.學校

STratio 對小班/大班的影響是否相同？

?是的，β?^ STratio對任何區(qū)都是相同的（平行線）

添加虛擬變量可以改變一切

交互項

?回歸模型

在多元回歸模型中，?β ^1 描述了X?1的邊際效應(yīng)，同時控制了X?2 的效應(yīng)。內(nèi)置假設(shè)X?1 對所有觀測值具有相同的效應(yīng)。

交互

放寬這種假設(shè)的一種方法是允許效果變化。

我們通過使用交互來實現(xiàn)這一點，我們將解釋變量的乘積添加到模型中：

?????????????????????????Yi???????=???????????β0?+?β1X1i?+?β2X2i?+?β3X1i?·?X2i?+?εi

圖 1

圖 2

?

圖 3

交互：虛擬變量和回歸

為什么假設(shè)效應(yīng) (?β?1?) 在所有子組中都是恒定的？
讓我們根據(jù) big.school 讓 STratio 產(chǎn)生不同的效果：

?????????????????????????????????????yi???????????????=?β0?+?β1xi?+?β2di?+?β3di?·?xi?+?εi

如果大班d?= 1，小班d?= 0，我們得到大班：

對于小班：

> srereg(list(model1,model2, model3))

?

STratio & 大班

虛擬變量可以做什么

定性信息

我們可以將定性信息（名義變量）納入回歸模型

允許靈活模型

我們可以使用與虛擬變量的交互作用來允許不同子組中 X 的不同邊際效應(yīng)。

交互作用：x和x?2

為什么要加x?2？

有時我們希望x對y的影響是非線性的。

交互作用：x和x2

如果所有Xi?>?0：

如果β?x為正且β?x?2為負，則得到倒 U 形?如果β?x為負且β?x?2為正，則得到 u 形
如果兩者都是正面/負面的，你會得到：

讓我們看看大班

> screreg(list(model4, model5))

?

讓我們看看大班

X的作用是什么？

由于包含 X 2 項，X 增加一個單位的影響將不再具有恒定的影響。這被稱為第一個差異。

解，x = 14

> s.u <- sim(zout, x=xlow,1=x.hih)
> summary(.ot)

現(xiàn)在 x = 21

> xow <- set(zt,STratio=21)
> x.igh <- stx(z.t,STratio=22)

?

交互作用的后果

? 效果很難從輸出中立即判斷（正面或負面）
? 邊際效應(yīng)不是恒定的 - 任何關(guān)于效果的陳述都取決于 x 值
?

聯(lián)合假設(shè)檢驗

相互測試模型

區(qū)分模型

我們有許多測試來評估數(shù)據(jù)是否更適合模型 A 或 B。
? F 檢驗：告訴您哪個模型具有更好的數(shù)據(jù)擬合。
? AIC：告訴您哪個模型更適合，非常類似于 F 檢驗
? BIC：類似于 AIC，但對復雜性進行懲罰
！AIC 和 BIC 是特殊的，因為這兩個模型不需要相互嵌套。在相同的樣本上估計它們就足夠了。

F-Test：學校規(guī)模重要嗎？

> senrg(list(model6,model7,model8),str = c(0.01,0.05,0.1))

> anova

?

OLS 輸出中的 F 檢驗

模型 1：

模型 2：

> summary(model6)

相互測試模型

為什么要懲罰復雜性？

模型越復雜，對數(shù)據(jù)的擬合就越好。
更復雜，更少簡約。由于過度擬合數(shù)據(jù)的危險，我們需要簡約的模型進行預(yù)測。

理論檢驗與預(yù)測

當我們通過檢驗假設(shè)的方式檢驗理論時，我們使用F 檢驗。
在進行預(yù)測時，我們希望使用 BIC 來區(qū)分模型。
?

AIC 和 BIC

Akaike 信息準則和貝葉斯信息準則
AIC 和 BIC 都反映了數(shù)據(jù)擬合的程度以及包含的解釋變量的數(shù)量（復雜性）。當數(shù)據(jù)擬合更好時，度量會下降。當添加更多變量時，度量會上升，較低的值更好。
? 在試圖區(qū)分兩個模型時，BIC 和 AIC 可能會給出相互矛盾的建議
? 如果您正在測試一個理論，如果模型是嵌套的，則使用 F 檢驗，否則使用 AIC
? 如果您正在進行預(yù)測，請使用 AIC 或 BIC
? 越保守度量是 BIC
?

F 檢驗：示例

? 請記住：F 檢驗拒絕了更簡約的模型
? AIC 幾乎不支持更復雜的模型(3059.089>3057.248)
? BIC 拒絕了更復雜的模型
?

screeg(list(mode6,mol8),stars = c(0.01,0.05,0.1))

?

F 檢驗、AIC 和 BIC

? 使用 F 檢驗時：

? 一個模型必須嵌套在另一個模型中

? 兩個模型都必須根據(jù)相同的觀測值進行估計

? AIC 和 BIC 可以處理未嵌套的模型

? AIC 和 BIC 僅在兩個模型在同一樣本上估計時才有效

自測題

A politician argues that immigrant children (defined as children for whom English is not their first language) cause entire districts to perform badly on standardized tests. Use the California school data to verify if there is any relationship between share of immigrant children and average test score.

?There is a measure for how CME or LME a country is (market index=0 is fully CME, 1 is fully LME). The theory predicts that being close to either ideal type should cause lower unemployment rates due to consistent institutional arrangements. Create a model to test this statement and include at least three additional explanatory variables. Interpret the results statistically and substantially

Estimate a model explaining why somebody self-identifies as left or right on a 1-10 scale. Use at least two variables and in addition age. Is age a significant factor? Interpret st & su. In a second step, control for party affiliation and verify whether the results from the original model change or still hold

Datasets
1) SwissData2011.dta
Post-referendum survey among people living in Switzerland. The following list of variables is your
codebook:
? VoteYes { Is 1 if someone voted yes and is 0 if someone voted no.
? male { Is 1 for men and 0 for women.
? age { Age in years.
? LeftRight { Left-Right self placement where low values indicate that a respondent is more to the
left.
? GovTrust { Variable indicates a respondent’s trust in government. Little or no trust is -1, neither
YES nor NO is 0, and +1 if somebody trusts the government.
? ReligFreq { How frequently does a respondent attend a religious service? Never (0), only on special
occasions (1), several times a year (2), once a month (3), and once a week (4).
? university { Binary indicator (dummy) whether respondent has a university degree (1) or not (0).
? party { Indicates which party a respondent supports. Liberals (1), Christian Democrats (2), Social
Democrats (3), Conservative Right (4), and Greens (5).
? income { Income measured in ten different brackets (higher values indicate higher income). You
may assume that this variable is interval-scaled.
? german { Binary indicator (dummy) whether respondent’s mother tongue is German (1) or not
(0)
? suburb { Binary indicator (dummy) whether respondent lives in a suburban neighborhood (1) or
not (0)
? urban { Binary indicator (dummy) whether respondent lives in a city (1) or not (0)
? cars { Number of cars the respondent’s household owns.
? old voter { Variable indicating whether a respondent is older than 60 years (1) or not (0).
? cantonnr { Variable indicating in which of the 26 Swiss cantons a respondent lives.
? nodenomination { Share of citizens in a canton which do not have a denomination.
? urbanization { Share of citizens in a canton which live in urban areas.
?

2) \violence data v2.dta"
Cross-national data set covering the years from 1980 to 1997.
? code { Country code, alpha, 3-digit
? country { Country name
? africa { Dummy variable for Sub-Saharan African countries. (According to World Bank definition)
? agovdem80 { Anti-government demonstrations: Any peaceful public gathering of at least 100
people.
? assass80 { Assassinations: Number of assassinations per thousand population, decade average.
? blck80 { Black Market Premium: Log of 1+ black market premium, decade average.
? cabchg80 { Major Cabinet Changes: The number of times in a year that a new premier is named.
? compolt80 { Dummy =1 for countries with genocidal incident involving political victims.
? constchg80 { Major Constitutional Changes: The number of basic alternations in a state’s constitution.
? corrupti { Knack and Keefer measure of corruption (1980-89)
? coups80 { Coups d’Etat: The number of extraconstitutional or forced changes in the top government.
? deathsPC80 { Deaths from Political Violence per One Million Citizens
? democ80 { Measure of democracy (Gastil’s Political Rights)
? elf60 { Index of ethnolinguistic fractionalization, 1960. Measures probability that two randomly
drawn individuals come from the same ethno-linguistic group.
? govtcris80 { Major Government Crises: Any rapidly developing situation that threatens to bring
government down.
? gunn1 { Gunnemark1: Percent of population not speaking the official language.
? gunn2 { Gunnemark2: Percent of population not speaking the most widely used language.
? gyp80 { Growth rate of real per capita GDP.
? latinca { Dummy variable for Latin Amercia and the Carribean.
? lly80 { Financial Depth: Ratio of liquid liabilities of the financial system to GDP, dec
? lrgdp80 { Log of initial Income: Log of real per capita GDP measured at the start of each decade.
? lrgdpsq80 { Log of initial (Income squared): Log of Initial real per capita GDP squared.
? lschool80 { Log of Schooling: Log of 1 + average years of school attainment.
? ltelpw80 { Log of Telephones per worker: Log of telephones per 1000 workers
? muller { Probability of two randomly selected individuals speaking different languages.
? newspaperPC80 { Newspapers per capita
? pavroad80 { Paved Roads (percent of total)
? pop80 { Country Population
? purges80 { Purges: Any systematic elimination by jailing or execution of political opposition.
? racialt { Racial tension for 1984, 1 (low tension) to 6 (high tension)
? radiosPC80 { Radios per thousand population
? revols80 { Revolutions: Any illegal or forced change in the top governmental elite.
? riots80 { Riots: Any violent demonstration or clash of more than 100 citizens.
? roberts { Probability that two randomly selected individuals do not speak the same language.
? rulelaw80 { Rule of Law: Law and order tradition (0-6 scale)
? surp80 Fiscal Surplus/GDP: Decade average of ratio of central government surplus (+) to expenditure (-).
? tvPC80 { Televisions per thousand population
? war80 { Dummy for war on national territory during the decade
? warciv80 { Dummy for civil war
? wdiinfmt80 { Infant Mortality Rate:Number of deaths of infants under one year old per 1000
births.
? wdilabag80 { Labor force in farming/forestry/hunting/fishing as a percent of total labor force.
? wditfert80 { Fertility: The average number of children born alive to a woman in her lifetime.
?

3) WDIdata.csv
Variable descriptions and definitions are available in the WDIdata Description.csv file. This file is
available together with the exam datasets

Question 1?
This question consists of three parts (a, b, c) and you must answer all three parts. Use the dataset SwissData2011.dta and answer all parts of the question. You are hired as a political
consultant by a Swiss political party and they ask you to write a report on left-right selfidentification in the Swiss electorate.
a) Provide mean and median for age and left-right self-identification, respectively.
b) Using a t-test you should answer the following question: Are German-speaking voters
more to the right than non-German speaking voters?
c) Using a t-test you should answer the following question: Are older people more to the
right than younger people? Note: use variable old voter.

Question 2?
This question consists of three parts (a, b, c) and you must answer all three parts. Use the
dataset \violence data v2.dta" and answer all parts of the question. A widely acknowledged
measure for the extent of poverty is the child mortality rate (wdiinfmt80). Your task is to build
a statistical model explaining child mortality rates.
a) Find at least six theoretically important predictors in the supplied dataset and explain
why you would expect them to have an effect on wdiinfmt80. Estimate a linear regression
model and present your findings. Interpret your findings substantively and statistically.
You should also discuss the model quality. (Note: Do not use latinca or africa in your
model.)
b) Add two dummy variables for Africa and Latin America (africa, latinca) to your model
from part (a) in Question 2. Does the inclusion of these two variables change any of
your findings? Explain how one can statistically test whether these two variables should
be included in the model. Implement your suggested test, present and interpret the test
results. Explain how the inclusion of africa and latinca may help you deal with the omitted
variable bias problem.
c) Civil wars unleash a number of negative consequences and one of them can be wide-spread
poverty. In this third part you will analyze why wars occur. You should develop a logit
model to explain why a country experienced a civil war (warciv80). Estimate the model
and include at least four reasonable explanatory variables (no need to justify your choice,
but make sure that one variable is continuous). Assess the quality of your model by
looking at the correct predictions of civil wars from your model. Provide substantive and
statistical interpretations of the effects and provide at least one figure to illustrate the key
insight of your model.

Question 3
This question consists of four parts (a, b, c, d) and you must answer all four parts. Natural
resource wealth is often associated with poor socio-economic outcomes. This is the so-called
\resource curse" hypothesis that is most often associated with a working paper by Sachs &
Warner (1995) which was further developed in their 2001 article: \Natural Resources and
Economic Development: The curse of natural resources," European Economic Review 45: 827-
838. Three main elements in the \curse" are negative effects of natural resource wealth on (1)
economic growth, (2) risk of conflict, and (3) political regime type. The original research paper
relied on a cross-sectional design. You’ve been hired by a think tank to produce a report on the
empirical evidence for the resource curse hypothesis using panel data. Furthermore, original
article used a general concept of natural resources. It can be argued that natural resources
differ in their impact on economic growth. For example, the effect of oil abundance may be
more pronounced than the effect of natural gas abundance. In extension of the original Sachs &
Warner argument your contract with the think tank states that you need to empirically assess
whether resource curse is detrimental to economic growth and whether various types of natural
resources have a differential effect on economic growth.
For the exercise we prepared a sample of data from World Bank Development Indicators
(WDI) WDIdata.csv. Please note that WDI variables may differ from variables used in the
original study. Instead you should choose variables for your model conceptually based on your
interpretation of the Sachs & Warner theory and any other readings you may have encountered. For the natural resource wealth, WDI provides data on rents received by the state from
specific natural resources as a percentage of GDP. You may find variable description in the
WDIdata Description.csv file.
Please note that some variables may have unequal coverage across countries and years as
some of the data may be missing. That means that depending on your choice of the variables
you may end up with a smaller effective estimation sample size than expected.
In your analysis you need to address the following questions:
a) Estimate the effect of resource curse on GDP growth rates for the full panel of countries
and years. You should estimate individual, time, and twoway fixed effects models. Explain
how your results change across these three models, and why that may be the case.
You should estimate your model using the relevant R packages as discussed in class and
present model results in a minimally formatted table as specified in the exam description.
b) Explain how you would address the issue of serial correlation in your data. Implement both
approaches that we covered in class { heteroskedasticity and autocorrelation consistent
(HAC) standard errors and lagged dependent variable (LDV) { and discuss your findings.
For the model with lagged dependent variable (LDV) also calculate immediate and longterm effects of natural resources on GDP growth.
You should estimate your model using the relevant R packages as discussed in class and
present model results in a minimally formatted table as specified in the exam description.
In addition, you need to show your calculations of the immediate (short-term) and longterm effects of natural resources on GDP growth.
c) Explain how you would address the issue of cross-sectional dependence. Implement the
Driscoll-Kraay estimator (SCC estimator) to address cross-sectional dependence and explain your results. Please note that depending on the sparsity of your data (how much
of the data is missing) the test for cross-sectional dependence may not have the expected
performance.

最受歡迎的見解

1.R語言多元Logistic邏輯回歸應(yīng)用案例

2.面板平滑轉(zhuǎn)移回歸(PSTR)分析案例實現(xiàn)

3.matlab中的偏最小二乘回歸（PLSR）和主成分回歸（PCR）

4.R語言泊松Poisson回歸模型分析案例

5.R語言混合效應(yīng)邏輯回歸Logistic模型分析肺癌

6.r語言中對LASSO回歸，Ridge嶺回歸和Elastic Net模型實現(xiàn)

7.R語言邏輯回歸、Naive Bayes貝葉斯、決策樹、隨機森林算法預(yù)測心臟病

8.python用線性回歸預(yù)測股票價格

9.R語言用邏輯回歸、決策樹和隨機森林對信貸數(shù)據(jù)集進行分類預(yù)測

標簽：