Date post: | 03-Dec-2023 |
Category: |
Documents |
Upload: | independent |
View: | 0 times |
Download: | 0 times |
1
Application of European Union Agriculture Policy
Estimation of Agriculture Wages
Asmina Simota M2015364
Kristen Scott M20015382
Mara Reis M2014158
2
I. Introduction The last decades European Union recognized the need of support to agriculture wages.
This sector is important for two main reasons. Firstly, it produces necessary survival
products and secondly, from a historical view, it is the first human sector of activity. The
purpose of this project is to identify the attributes that most affect this unit and to create
some policy proposals. For better comprehension of the problem that we deal with, I will
include a section that explains briefly the special characteristics of this sector and the
European Union’s adaptation policy.
A.1 Specialties of Agriculture sector
Nowadays, even though the agriculture production increases, the sector’s unemployment
follows in the same way. Moreover, its share in total productions seems to decrease. For
example, the tables below are from the annual report from Eurostat in year 2013:
In the first table we can see that the percentage of people working in this sectors has
decreased over these years. *
Commented [as1]: Na grapsw kati gia etos pout ah xrisimoisw tha de kai gia regression
3
The second table which is part of an excel file, makes obvious the decreasing
participation of the agriculture sector in the Total GDP**
As we mentioned economists have already identified the problem and they highlight the
following
Inelastic demand in price terms
This means that as the production of products increases the percentage of the prices
tends to decrease more, in order to absorb the added product. As a result of this
progress the average income of the sector decreases. For instance, machinery and
technologic development, which push up the production, finally will affect the incomes.
Inelastic demand in terms of income
Additionally, this means that as the income of households increases, families tend to
spend less percentage of their money for agriculture products. And of course the more
developed the country the bigger that effect.
The increasing demand for food marketing services
As countries develop, there is a growing preference for products which have been
passing through more pre-processing. For example, semi cooked meals, frozen fruits and
salads are some of these products. As a result, we have a boost in the secondary sector.
Land
4
The total supply of land is limited. First with the growing need for built up areas in there
is decrease in available land amount. For example, parks, streets, hospital, schools and
apartments are only some categories of increasing built of land areas. Secondly, because
all of these procedures are time consuming, the total supply of land has small elasticity.
Employment
We should notice that agriculture production belongs mainly to family units. That means
that it is hard to identify workers with the required skills. In the most cases children
work or help in the unit without salary or with a small amount of salary
Another important issue is the low movement inside agriculture activity. In simple
terms, if someone owns a farm it’s very difficult to change his base as he needs to
change the place of the whole farm.
Capital (landed, exploitation, financial)
Landed: irrigation systems and land available
Exploitation: tractors, machines and animals. Animals here are included in the capital
as it is obvious.
Financial: financial supply for investments and developments. In the majority of the
countries there are executive banks undertaking this responsibility
We believe this information is enough for someone to understand the problem and the
next steps. Deeper economic analysis can be done in next steps.
A.2 Common agriculture Policy in EU***
The common agriculture policy (CAP) is the agriculture policy of the EU. It implements a
system of agriculture subsidies and other programs.
The main purpose of this ‘’organization’’ is:
1. To increase productivity and ensure the optimal use of the production factors.
2. To ensure a fair standard of living for agriculture community
3. To stabilize markets
4. To secure availability of supplies
5. To provide consumers with food at reasonable prices
Commented [O2]: Can you explain this to me?
Commented [O3]: Is this a note to us that we are removing?
5
The cap was born in 1962.Several policies have been applied since then. Some of the
most important applications were:
1992- The CAP shifts from market support to producer support. Price support is
scaled down, and replaced with direct aid payments to farmers. They are encouraged to
be more environmentally-friendly.
1970’s- Supply management. Farms are so productive that they are producing more
food than is needed. The surpluses are stored and lead to ‘food mountains’. Specific
measures are put in place to align production with market needs.
2000- The CAP centers on Rural Development. The CAP puts more focus on the
economic, social and cultural development of rural Europe.
2003- A CAP reform cuts the link between subsidies and production. Farmers are
more market oriented and, in view of the specific constraints on European agriculture,
they receive an income aid. In exchange, they have to respect strict food safety,
environmental and animal welfare standards.
2011- A new CAP reform seeks to strengthen the economic and ecological
competitiveness of the agricultural sector, to promote innovation, to combat climate
change and to support employment and growth in rural areas.
* http://ec.europa.eu/agriculture/statistics/agricultural/2013/pdf/c5-1-351_en.pdf
** http://www.gapminder.org/data/
*** http://ec.europa.eu/agriculture/50-years-of-cap/files/history/history_book_lr_en.pdf
6
II. Analysis
(Descriptive)Multivariate linear Regression0-OLS
The purpose of this section is to describe in the way the components of agriculture affect
wage. A good analysis in this section not only can give a clear view of the situation but
also can be a potential tool for a good policy application. The base year for this analysis
will be 2007.This choice made based on two main factors. First, the content of the tables
at this year is very rich. The least statistics required to deal with missing values the
better the model. Also, and most important, this year is interesting from an economic
view, as it is when Europe started to have the first symptoms of economic crisis.
The countries that we will use are the EU-27(those that had entered Union by 2007)
1.Austria 7. Estonia 13. Italy 19. Netherlands 25. Sweden
2.Belgium 8. France 14. Ireland 20. Poland 26. Rumania
3.Bulgaria 9. Finland 15. Latvia 21. Portugal 27.U.Kingdom
4.Cyprus 10. German 16. Lithuania 22. Slovenia
5.Czech Re. 11. Greece 17. Luxemburg 23. Slovakia
6.Denmark 12. Hungary 18. Malta 24. Spain
Variables Selection****
After research into economic paths, the results of which were described briefly in the
introduction, we identified the following variables as important for analysis. We will
then check goodness of fit and decide which of them we will keep.
Crop Production Cereal annual
Crop Production Pulses annual
Root crop Production annual
Raw milk production in 1ooolt annual
These variables are enough for our purpose. Production in agriculture can be estimated
by 3 categories of crops, such as cereal, pulses and roots. These are the majority of the
production. Plus, we included raw milk production, which we converted from liters to
kilograms for the analysis. Regarding slaughtering, we believe it’s important but we did
not locate a table specified by each species category. Therefore, if regression shows a
positive coefficient we will have not have the ability to recommend specific policy. Meat
production consists of many species
Wage-Income (target variable) in millions
Commented [O4]: ?
7
The average income is not an ideal variable in this case, because we don’t know the
amount of workers in this sector so we can’t fit it with the other variables. The reason
for that, again, is that the shape of individual farm is more family oriented and workers
are often children. Of course we have the percentages, but this also only represents the
adults, who probably own the farm, or pay some insurance coverage. So, in this case we
found it mandatory to use another point of view, which is the Gross Value of Agriculture,
simply, is the amount of total produced, multiplied by basic producer’s price less the
taxes and all intermediate consumption, but with subsidies. We choose this because we
want to know the real values a producer takes, and then with regression identify the
reasons for variations. Also we kept subsidies in the value because Europe already
applies this subsidy policy so they have an effect on our situation. About the question,
how do we know that producer sells all this production? The answer is that usually
governments absorb the rest of supply. Major impactful political situations such as the
Russian Strawberries embargo in 7/8/2014 cannot predicted.
Livestock Cattle
Livestock Pig
Livestock sheep
Livestock goats
Agriculture land in %
The Livestock variables, as we initially mentioned they belong to the landing and
exploitation capital. Also, they can describe in a satisfying amount the slaughtering
variable which we did not include. Agriculture land was converted to square kilometers.
Total R&D in Euros
R&D is useful for two main reasons. First, it is one of CAPs targets especially in the next
years. Its effect can be very important. Also, as we don’t have a mechanism equipment
variable, we assume that technology exists inside this variable. We should not forget
that technology initially is always a products of research.
% Female agriculture workers
% Male agriculture workers
It is useful to have a separation in women and men and see the results of regression.
Price of diesel oil per 100lt
These prices are significant as they are the main cost for agriculture production. Fuel
and especially diesel can greatly affect income. As we mentioned the income from
agriculture is very sensitive to changes in cost and small enough in relation with the
price that consumers pay for the products. For example, fruits can pass through 3
Commented [as5]: Ligo akoma explenationj Kai mipos bro kai kamia metavliti akoma Isws mixaniko eksoplismo..einai ta pososta swsta
8
different mongers before appear in supermarket shelves. So changes in attributes such
as fuel can have serious impact. Also I will use 1oolt amount to balance it with other
prices.
Consumption of manufacture fertilizer Potassium tones ->Kg
Consumption of manufacture fertilizer Phosphorus tones->Kg
Consumption of manufacture fertilizer Nitrogen tones->Kg
Fertilizers are very interesting this case study. We can see them affect wages from two
different ways. First, they may be the most important component for productivity. A
huge amount of production comes from their usage. Secondly, their cost is very high for
the producer. We can see here that we have impact from opposite directions. Positive
impact from usage, negative from cost. So it is reasonable to use the square of the values
in order to describe the decreasing positive effect of these variables. Tons were
converted to Kg because we want uniformity to our dataset.
Selling price for soft wheat Rice per 100kg
Selling price per 100kg potatoes
Selling prices per 100kg raw milk
Selling price per 100kg cattle
Selling prices per 100kg sheep
Selling price per 100kg pig
Of course selling prices are our main interest. In the animal category sheep and goats
tend to have the same price, so we will adapt for both in the same way. For pigs and
cattles we have high differences so we can’t ignore their individual price. Regarding
crops, of course it is very hard to find data on the production of each plant and also the
price. Thus, we will use some prices as representatives of our categories. We noticed
that with small disturbances all tend to have the main price. Also, our purpose is not to
analyze demand among products, but annual wages. So potatoes can satisfy root
production and wheat general crops.
Direct Taxes on income %of GDP
Expenditure in Education % of GDP
It’s very hard to define the taxes that real a producer pays, as they are separated in
direct and indirect. Thus, we will use the obvious the direct income tax that everybody
pays. Additionally, in education we couldn’t find specific numbers for agriculture
expenditure but we believe the total percentage expenditure in education is a valuable
variable in our research. We will change both of them to real numbers, by multiplying
with the GDP
9
One another thing that we would like to mention.is that we didn’t include at all
biological production or luxury products. We believe that both these cases deserve to be
analyzed as special categories. For example, today, goji berry is a well-known super food
with a per kilo price of more than 100 euros in a supermarket. Here we don’t have at all
an inelastic demand. It is a special category of ’’pharmaceutical’’ food. So is out of our
interest. Also, biological products are also a category based on health interests and are
mainly consumed by wealthier people due to higher costs. In summary, in this project
we focus on the basic agriculture products. Those which are useful as industry input but
also ready for consumption. Those which cover the majority of agriculture GDP.
Furthermore, we would like to note that we will not use, at least in this part, the GDP
even in Total or Agriculture sector. After deep consideration, we believe that an
indicator such as GDP is already represented by our variables, is already an index of
wealth and a summarization of the situation. We want to explore alone all these
attributes so we excluded it. Also, we want to avoid multicolinearity in the equation and
redundancy.
Also, consumption was considered for being very important but we rejected it. The
reason is that even though consumption is the main target of every producer, policies of
the European Union will likely not affect it. We know that Governments just absorb the
rest of production. The main goal of this project is to identify policy agents to push up
wages.
Additionally, the index price of rent and land was very useful. but more than 50% of
values were missed. So with so few observations it was very dangerous to deal with this
and keep it. Also after consideration, we know that in the majority land is used for
family owners so these rents are not so interesting. Yes of course it could be very useful
if we knew. For example, if we had after regression an extreme high correlation with
index of rent & land, that could mean a subsidy per hectare. But there is too high risk of
inconsistency with the current data.
Finally, we would like to notice that in this project we focus only on harvesting and
animal production. So, fishery and forestry variables are excluded.
Of course the diversity of variables could be more rich, but these were the most
important found in available data. Also, after research we do feel we have captured the
most affective
Variables that we would like to have and didn’t find are the:
Industrial equipment in use
Poultry Livestock or Production
10
Final Variables for part 1 (with correcting values to metric to improve the model)
Production Crop Production Cereals, per 100Kg harvest CereCrop
Crop Production Pulses, per 100Kg harvest PulsCrop
Root Crop Production , per 100Kg harvest RootCrop
Raw milk Production 100Kg RawMIlk
Livestock Cattles ,thousands heads LVcattle
Pigs ,thousands heads LVpig
Goats ,thousands heads LVgoat
Sheep, thousands heads LVsheep
Income Gross Value in millions Inc
Stock Agriculture land km^2 ALand
R&D Total R&D in Euros RD
Fertilizers Potassium, per 100 Kg square FPot
Phosphorus, in Kg square Fphos
Nitrogen, Kg square Fnitro
Selling Prices Of Soft Wheat per 100kg PrWheat
Of Potatoes 10okg PrPotat
Cattle 100kg PrCattle
Sheep 100kg PrSheep
Pig 100kg PrPig
Per raw Milk 100 Kg PrMilk
Others % Female agriculture Workers FemalePr
% Male agriculture Workers MalePr
Price of diesel oil per 100lt Prdiesel
Taxes in Euros Tax
Education Expenditure in Euros Educ
Agriculture land in km^2 Land
**** http://www.gapminder.org
http://ec.europa.eu/eurostat
http://data.worldbank.org
11
1.Deal with Missing values
For the purpose of this section SAS miner will be used. The data sets we use are mainly
from Eurostat but they contain some missing values that we have to deal with. This is a
very important step because we don’t have a lot of observations, so the decision of each
missing value can be crucial. There are several methods for replacing values. One of
them is the manual way, based mainly on user’s experience. We will use this method,
because even though we explore only 2007, we have several tables from other years
which can help us to forecast the current value. Also for those without values at all, we
will use the similar variables replace methods, manually. That means that for countries
without elements we will find similar countries and adapt their values. For the last part
the main comparison tool will be the agriculture GDP. Of course after all changes, we
will check again the statistics. High changes especially in standard deviation and mean
denote probably a mistake and uncertainty.
Missing Values
Statistics –with missing values
Commented [as6]: Change Milk production to Kg Consumption apo tones se kilso square root
12
The table above is a good start to take a view of our variables. We can see the mean,
standard deviation, skewness and our missing values number.
Statistics-without missing values
Comparison Statistics Table
Mean Standard Deviation
NonMissing Missing NonMissing Missing
CereCrop 2116.87 2116.87 2658.79 2658.79
Educ 23876.82 24429.11 35474.34 36058.29
FemalePr 5.248 5.9 6.48 6.8
Fnitro 4279093 4556125 5723445 6023222
Fphos 549074.1 557000 741330.7 785764.9
Fpot 1234444 1230292 1708561 1799359
Inc 5736.35 8507.17 5736.35 8507.17
LVpig 5909.981 5909.98 7731.72 7731.72
LVsheep 3522.884 3653.82 6318.57 6406.23
Land 72867.96 72867.96 92117.39 92117.39
LVcattle 3312.278 3312.78 4526.89 4526.89
LVgoat 494.24 570.13 1074.79 1150.76
MalePr 7.38 7.36 5.45 5.69
Prcattle 222.35 211.89 75.43 77.12
13
PrMilk 31.86 32.01 4.41 4.14
PrPig 121.44 115.28 39.43 32.76
PrPotat 24.88 24.88 9.18 9.18
Prsheep 115.99 122.01 114.69 120.26
PrwWheat 19.14 19.14 3.56 3.56
Prdiesel 95.10 95.11 21.05 21.05
PulsCrop 47.58 47.58 72.86 72.86
RD 8490.17 8490.17 14736.32 14736.32
Rawmilk 5299.44 58350.77 72092.21 72711.88
RootCrop 154.3 154.3 215.10 215.10
Taxes 63178.63 63178.63 96843.31 96843.31
We can see that basic statistics (mean and standard deviation) remain the same. That
means that the replacement of missing values was successful so we can continue with
the analysis.
2.Estimation Method
Before we continue to our main section, we should introduce our analysis methods. We
decided that we want to identify a line which connects all the important components of
agriculture Income. We have cross sectional data, which have no special disturbances. So
we feel comfortable to hope that this process is not impossible. As estimation method we
choose the OLS. If all the assumptions of OLS are unviolated then the estimator is BLUE
means that is also the best estimator among all (not necessarily linear). With least error.
3. Model building Now that we are done with missing values we can continue to our main analysis. To
repeat the purpose of this project is to identify the best linear relationship between the
agriculture Income and some predictors. Of course we cannot keep all of the variables
that we introduced. This happened to have a flexibility in our model, in case we need
suddenly one of them. Also, we don’t know the descriptive ability of each variable for the
beginning and we should be flexible to choose among the best of them.
The basis for this part well be the OLS regression assumptions, which briefly are:
1.liner in parameters
2.no perfect colinearity
3.zero conditional mean Estimator is Blue Blue and equal to ML
14
4.homoscedasticity
5.uncortrelated errors
6.normality of errors
Correlation matrix (drop inaccurate & meaningful variables)
We start with the second assumptions, because it is useful to drop “bad” variables from
the model and make it more effective. Colinearity can happen in two ways. First with
high correlation among independent variables. Secondly with a near linear relationship
between Y and X. This can be potentially bad when we try to explain the relation of each
independent variable with the dependent, because makes hard to separate the effects.
In this part, we will focus to the relationship between independent variables, by creating
the Pearson correlation Matrix in Sas Miner.
Pearson Correlation Matrix
15
Generally, the minimum price of accepting correlation is the 0.8. In the table above we
have several values more than this level. Specifically, the problem comes mainly from
the amount of production, education, agriculture land and research. Prices tend to be
uncorrelated in satisfying degree. Fertilizers have some small issues. In regression
analysis usually this happened when you have a lot of variables of the same measure.
Production
Let’s skip Land at this moment, as it will ultimately be excluded from the model, and
focus only to the relevant variables. We can see high positive correlation. This means
that if the production of cereal crops increases then also the other increases. This is a
very useful conclusion for deciding the method that we will use. One solution is using
the sum of all the variables. But this is forbidden because each one has a different price
so it is impossible to find the appropriate corresponding. So, we keep the cereal crop and
because of high correlation I can assume that the effect is the same from the others also.
Taxes
16
Taxes is one of the variables that we will reject. It has high correlation with a great
amount of the others, also, after consideration is not a meaningful variable. When we
choose variables for regression except statistics we should also think the importance as
researchers. Let’s assume that taxes estimator shows negative relationship with income.
Then EU should probably give a subsidy in percentage of total taxes. This is forbidden.
Taxes are the government’s “income”. Then the subsidy helps the government and not
the worker. Also, if we support taxes then easily a government can increase the amount
just to take advantage of this situation.
On the opposite view let’s assume a positive relationship with the estimator. This means
that the more taxes you have the better income. Probably this is an effect of tax per
amount of production. But really this is still not an issue for EU but for individual
governments.
Livestock
We can see a huge correlation between livestock of cattle with several variables. Of
course, It is very hard to define the reasons for that, especially when you are not a
participant in the business. For example, we can see 0.91 correlation between education
and livestock cattle, which does not have an obvious reason. On the other hand, there
are also some obvious effects such as 0.93 correlation between cattle and raw milk. The
more cattle you have the more milk production you get. Also the positive correlation
with taxes is probably an effect of some restricted policy. European union many times
notice the environmental protection policy that it wants to apply, and we know that
production is one major component.
About sheep and goats, we didn’t notice an important high correlation but we know that
they don’t have a large share in the European agriculture economy. So we can keep only
one, such as goats, which are more independent and try to give more space to pigs
whose production have huge impact in Total terms.
17
Finally, in the same way as crops it is impossible because of the price to combine these
elements. We keep goats and pigs.
Fertilizers
Here the situation is different. we don’t have price indexes so we can try the sum and
maybe after a log transformation to fit the values to the model.
High correlation with taxes again likely comes from some environmental policy.
Agriculture employments
Obviously we can’t keep both these variables as they are mutually exclusive and thus
colinear. So if we want we can continue with males, which are the main employers in
agriculture. But, again the main purpose here is not to explore male or female
components. It is to identify support policy. So at least for the basic linear regression we
exclude both of them.
Education and Research
Education and research are highly correlated. This is expected because one effects the
other. The higher education you have the more research you get. Probably people with
more education maybe masters, graduate’s degrees and PhDs are more willing to
discover new patterns that make the economy more productive. But we also considered
that higher education is not a highly important component in agriculture. The skills of
this work is specific, requiring feeding the animals and harvesting the land. It’s not a
business where if you have a masters then you will be a director with increasing wage.
Therefore, we believe that research is the key point. Approaches and technologies to
improve production. For this reason, I decided to use only Research.
Land
18
After consideration we decided to exclude land from the analysis. Firstly, it is highly
correlated with important variables. And this is expected as we know that the problem
of land space is one of major components of poverty. But let’s assume that we solve this
problem and we include it in regression. If the coefficient is high and positive, then if
land increases, wages increases also. This project purposes to identify good policies for
support agriculture income. That means that in this case, EU should find solution to
increase Land share. This is impossible for many reasons. 1)Land is not so easily
available. It is used also for housing, manufacturing, forests, museums.2) It is already
shared. Proprietary rights are respected.3) Employers can’t move easily. Agriculture
workers, usually work in a specific location. They have there their land, farms and etc.
So it’s very hard to give them land in another place to work.
Conversely if we have negative coefficient than we are talking about demand of
harvesting fallow. We believe that the coefficient will be positive. But we should check it
in practice before dropping this variable. We will run a first regression with the
variables remaining after this section and will examine the effect of land. If its small and
positive, then there is no reason to keep it after this high correlation.
Regression results
19
Let me for this moment, not go deeper in this variables decision. We will do that in the
next section. Now we focus just on Land. Of, course we have a positive estimate
coefficient 0.0774. If space of land increases at 1 km^2 then Income (gross value) will
increase at 7.7%. To be more accurate we will include also the test of significance.
Land histogram
The table below shows a normal distribution with a small skewness in the right. Let’s
test the Ho=0 hypothesis for 5% level of significance.
Ho: b3 =0 tcr =1.77093
H1: b3 ≠0 t=3.55
| tcr| < t, so the Ho is rejected and b3 is important. In simple terms our b3 estimator
can be generalized to the population.
However, given that the variable is highly correlated with other variables and that the
coefficient is small and positive as expected, we feel it is appropriate to exclude it.
Final variables table
Cereal production CereCrop
Livestock pig LVpig
Livestock goat LVgoat
Milk price Prmilk
Price pig Prpig
Price goat Prgoat
Price wheat Prwheat
20
Price diesel Prdiesel
Raw milk Rawmilk
Research & Development RD
Education Educ
Total fertilizers Fert
Before continuing, we want to notice that the reason that we don’t use transformations
such as log or squares to solve correlation problem, is that we want to drop some
variables. At this part we have 20 variables to explain the problem. We don’t want to
keep all of them because it will make the analysis confusing. So we used this approach of
correlation checking not only to clean the model but also identify which are the
appropriate variables to use. Some, such as Research & Development, education we will
keep even though they are high correlated for research reasons. For example, the
exclusion of crops production for the project is something that numbers allow but our
knowledge does not. So we keep them, and will try to fix this, with transformations, for
better goodness of fit.
Scatterplot Matrix (Predicted Vs finally predictors)
21
We entered at this part a scatterplot matrix with the variables that we
finally used. We want to identify any wrong values, probably outliers that
will make our model lose explanatory ability. We are lucky we can see in
green columns that there are no extremely concerning cases. We don’t
forget that we don’t have a lot of observations so an exclusion of price will
have a huge impact. This would also a problem in case of a validation tests.
We did not identify or exclude any outliers.
23
4.Model building
In this part we are starting to build our model. That means trials of different
relationships and forms of the variables to decide which are the best. Notice here, we
are trying to choose the better model. Probably we cannot avoid heteroscedasticity just
with transformations. But we can build a very good model, with correct signs in
estimators, a satisfying R, small errors and good generalization ability. Then we can
improve it by adding or dropping variables and fixing heteroscedasticity with some
methods.
Because this part can be really chaotic, we should consider carefully what is the question
that we want to answer for this research. Chaotic means, that there are hundreds of
combinations that could be done. So to keep that under control, we start by thinking
seriously what we want to learn from this model, and then get into more technical
factors such as R, errors, normality and the other linear regression assumptions.
24
When we try to build a model we have two things under consideration before we start
going deeply in the analysis:
Dependent variable should be continuous and normal distributed. Ideally, also the
independents.
Researcher’s experience about the topic.
Consideration Procedure
In this model we have to deal with two problems. First the majority of the histograms of
our variables have a right skewness. Log transformations are appropriate to solve this
problem. But, as researchers we anticipate that there are some patterns that we should
follow. Because the second consideration is just a guess, we will give respect to
mathematical assumptions of regression. Ultimately, we cannot avoid several trials until
we identify the correct model (trials are in the Appendix of the project). Then test of
significance can tell as if our model has good generalization ability.
Final Model after histograms based transformation
Final linear Relationship
logIncome=β0+β1*logCerecrop +β2*logPrWheat+β3*logRawmilk+β4*Prmilk^3+
β5*logLVgoat+ β6*logPrgoat+β7*sqrtLVpig+β8*sqrtPrpig+β9*logRD+
β10*sqtPrdiesel+β11*logFert
Transformations
Variables Model
CereCrop Log
Prwheat Log
Rawmilk Log
Prmilk X^3
LVgoat Log
LVpig Sqrt
Prgoat Log
Prpig Sqrt
RD Log
Prdiesel Sqrt
Fert log
Income(target) log
25
All the transformations are based on:
Log transformation of strong right skewness
Square root to weaken right skewness
Cube form for left skewness
For all the variables we tried to achieve normality, especially for the dependent in
case for some tests in next steps.
Histograms
30
Now what we should do is to check if any of the linear OLS regression assumptions are
violated.
Linear regression Assumptions
1. Linearity
We should ensure that the model of the dependent variable is a linear combination of all
the independents. There is no a straight forward test for this part but we can check it in
terms of misspecification. We can ensure that no nonlinear functions of the
independent variable should be significant when added to the model. For the
purpose of this section we will use the Ramsey Reset Test. The idea behind this model is
to test how significant are the estimators of quadratic forms for the model.
Notice that two regressions are estimated where the second is a version of the first, with
squared fitted values obtained from the first regression. Note that the squared fitted
values introduce the non-linearity into the specification.
𝐼𝑛𝑐𝑜𝑚𝑒 = 𝛾0 + 𝛾1𝐼𝑛𝑐𝑜𝑚𝑒̂ + 𝛾2𝐼𝑛𝑐𝑜𝑚𝑒2̂ + 𝛾3𝐼𝑛𝑐𝑜𝑚𝑒3̂
Sas output Ramsey Test
We will test for functional form with a t –test of the γ2 for the
The null hypothesis that the correct specification is linear.
The alternative hypothesis is the correct specification is non-linear.
Hypothesis:
Ho: γ2=0 tcr=2.1603
Η1: γ2≠0 |t|=0.07
tcr > |t| we accept the null hypothesis, the estimator is insignificant and we accept the
linearity of the first model.
31
2. No perfect Colinearity
Moderate multicolinearity may not be problematic. However, it can be a problem
because it can increase the variance of the coefficient estimates and make the estimates
sensitive to changes. The coefficient estimators are unstable and difficult to interpret.
Multicolinearity saps the power of the analysis. can cause the coefficients to switch
signs, and make it more difficult to specify the correct model.
For the purpose of this section we will use variance inflation factor(VIF), which indicates
the extent to which multicolinearity is present in a regression. It measures how much
the variance of regression coefficients are inflated as compared to when predictors are
not linearly correlated. A VIF of 5 or greater is a reason to be concerned about this.
We can see that Rawmilk, RD and FERT are highly correlated with A VIF of 18.36, 11.73 and 6.36.
32
We have to recall the correlation matrix to check again the relationships.
Correlation Matrix
Solution choices:
Remove highly correlated predictors Combine variables with ratios
Run different regression if nothing works Standardized predictors
In our case we cannot use combinations because the correlation is probably a spurious one and the variables are not related. For example, Research is highly connected with
Rawmilk. A possible reason that can explain this is that strong research could develop ways to exploit all the ingredients of Milk without wasting anything. For example, when
you make cheese you drop the ‘water’ left, which actually can make soft cheese with
some techniques. This possible explanation but we still identify the relationship as
spurious.
Consideration procedure
When we deal with a regression we should always remind ourselves the history and the question that we want to answer. In this project we want to create a policy to support
increased agriculture Income. In case of Rawmilk if we detect a positive high estimator probably the explanation is that production supports the income and so we should press
for more production. But milk production in farms usually passes through a strong
process in industry to take the appropriate form. Whole milk, cheese, skim milk, yoghurt. So we can’t easily control this relationship as in harvesting products or
animals. We cannot press industries to produce amounts they don’t need. On the other hand, Research is another important variable that shows the need for development and
33
new ways of production, which we believe is very significant in the 21st century given
strong competition. So we keep that variable.
Regarding Fertilizers, we tried to create a new variable; fertilizers used per Crop
Production, but unfortunately we had a high VIF in CereCrop. So, rather, we just exclude
the Rawmilk.
After running the new regression line we get:
Fert is still a little be high but we can ignore it.
Note, we found interesting the standardized approach, but chose go by this way as it is
not well known * **
34
* https://www3.nd.edu/~rwilliam/stats1/x92.pdf
** http://blog.minitab.com/blog/adventures-in-statistics/what-are-the-effects-of-multicollinearity-and-
when-can-i-ignore-them Very fast we check again the linearity assumption with Ramsey Reset test to ensure that everything remain the same.
Hypothesis:
Ho: γ2=0 tcr=2.1788
Η1: γ2≠0 |t|=0.28
tcr > |t| we accept the null hypothesis, the estimator is insignificant and we accept the
linearity of the first model.
3. Zero conditional Mean
This is a mandatory assumption to hold the unbiasedness of the estimators. The error term has zero conditional mean, meaning that the average error is zero at any specific
value of the independent variables. Simply, the error does not depend linearity or
nonlineararily on x. This is assumption is maybe the most serious in cross sectional data but the problem is that there is no a way to test it. Violation of this assumption means
that we have a systematic error among real population when we collect the data. So this can make our estimators biased and our model unable to predict. Because OlS is based
on that assumption we have just to accept that it is true.
In simple terms:
𝑬(𝒖|𝒙𝟏, 𝒙𝟐, … , 𝒙𝒌) =0
4. Normality of error
This assumption is very important in our case as we don’t have a lot of observations so
the central limit theorem is does not exist. Non normality of the errors will have some
impact on the precise p-values of the tests on coefficients etc. But if the distribution is not too grossly non-normal, the tests will still provide good approximations
35
Because we haven’t a lot of observations we will use the Shapiro-Wilk test, and the Q-Q
plot to have an optical view of the results.
Shapiro-Wilk test:
The basic idea behind the Shapiro-Wilk test is to estimate the variance of the sample in
two ways: (1) the regression line in the QQ-Plot allows us to estimate the variance, and (2) the variance of the sample can also be regarded as an estimator of the population
variance. Both estimated values should approximately equal in the case of a normal distribution.
We want to check if 𝑟~𝑁(𝜇0, 𝜎2)
Hypothesis: Ho: Wo>Wa follows normal distribution H1: Wo<Wa, no follows a normal distribution
Pvalue=0.585 > 0.05, so we can not reject the null hypothesis, we can assume Normality
36
QQ-plot
The QQ plot show us very clearly that the distribution is Normal. Even though it’s not
the perfect line we cannot identify no skewness, heavy or light tailed neither a binomial distribution. We accept normality.
Distribution of residuals
37
We can see also in the distribution plot that the residuals have a very very small left skewness but of course we can completely ignore that. Normality is accepted again.
5. Homoscedastic
The assumption of homoscedasticity (same variance of residuals) is very central. It describes the situation that the error term is the same across all values of the
independent variables. The violation of this assumption is called Heteroscedasticity and
is crucial. In simple terms OLS estimator try to minimize the error giving equal weight
to all estimators if heteroscedasticity assumption holds then this min error doesn’t come from all the variables so it’s very hard to identify where the error comes from. Ideally
the variance of the errors should be constant and equal for all observations. For the purpose of this section we will use plots (residuals versus predicted variable) to take a
first view and then Breusch-Pagan test and White test (package XLstat XL).
What we want: Var(ut)=σ
2
Residuals Vs fitted predicted variables
38
We see that the pattern of the data points is getting a little narrower towards the left
end, which is an indication of mild heteroscedasticity. We cannot identify a specific circular or pineal schema. It’s seems a little thinner in the middle but the general idea is
that variables are spread in the whole diagram .We will also conduct some tests.
White test: creates all the squares of independent variables and all the cross products. Run a regression of residuals. The problem with white test is that it can reject the null
hypothesis(homoscedasticity) not only because of no constant variance but also for misspecification. This is especially when your regression includes a lot of variables. So
we will check also Pagan’s test.
Breusch-Pagan: try to identify linear forms of heteroscedasticity. White’s test is actually
a special more relaxed case of that.
In the tables above we can see that both out tests accept the null hypothesis, meaning
that our model is homoscedastic. So we can continue with the other assumptions.
In case it homoscedasticity was violated, we would use the following tests: Weighted least square regression
Generalized linear regression
Pagan Test:
Run auxiliary Regression:
�̂�𝑖2 = 𝛾0 + 𝛾1𝐼𝑛𝑐𝑜𝑚𝑒̂
39
Hypothesis:
Ho: Var(𝜀𝑖)=σ^2, Homoscedasticity
Ηο: Var(𝜀𝑖)=σi^2 , for at least on i of residuals
χcr=19.6751 χ-val=𝑛𝑅2=12.744
χcr > χ-val, we don’t reject Ho and we accept Homoscedasticity
Output for Pagan test:
White Test:
Our initial linear equation: logIncome=β0+β1*logCerecrop +β2*logPrWheat+β3*logRawmilk+β4*Prmilk^3+
β5*logLVgoat+ β6*logPrgoat+β7*sqrtLVpig+β8*sqrtPrpig+β9*logRD+
β10*sqtPrdiesel+β11*logFert
40
�̂�𝑖2 =logIncome+ (all square variables) + (all cross products)
Hypothesis:
Ho: σi=σ, for all i=1,…n
H1: σi≠σ, for at least one residual for i=1,…n
p-va=1 p-v > a we don’t reject Ho so Homoscedasticity Holds a=0.05
Output White test:
6. Uncorrelated errors (no Autocorrelation)
Autocorrelation is mainly a problem in time series data, it comes for systematic errors in measurement or misspecification. For cross section data there are several opinions.
First, that observations are in the form of ID therefore the errors are in ID and we have no issue. The second issue is autocorrelation can come as a misspecification. Usually
spatial correlation. For example, in our case livestock of goat has a relationship with
livestock pig for each country and then their errors would not be independent. Because
we want to be as sure as possible we will assume that the second opinion is most
correct. All the assumptions of regression are based in misspecification. So under
41
contradiction induction if all the other assumptions hold, then they hold for
misspecification and for the assumption of uncorrelated errors.
General we want: 𝑪𝒐𝒓(𝒖𝒊, 𝒖𝒋) = 𝟎 , for i≠j
Consequences if Assumption is not Held:
The OLS estimate are still unbiased and consistent. OLS are inefficient so no longer Blue The estimated variances of the regression coefficients will be biased and
inconsistent, and therefore hypothesis testing is no longer valid. In most
of the cases, the R^2 will be overestimated and the t-statistics will tend to be higher.
7. Residual Sas output plots
42
1.IV. Interpretation of results
After we check all the assumption and noticed that the model we recommend is correctly
specified. (Meaning that all the OLS assumptions are proved so we are assured of the
quality of the estimation) we can pass is interpretation. We want to go deeper to the
model. Check the overall statistics, R-squared, overall F-test and Mean square error;
analyse estimators’ relationship with the dependent variable and check for their
significance. In case that an estimator is not significant meaning we reject the null
hypothesis, we can drop it from the model if we choose.
Sas Output
43
R^2
The R^2 or coefficient of determination is a number that indicates the proportion of
variance explained from a regression model. In simple terms, how well our model fits
the data. The higher this number the better (except zero and one, which means non
availability the one and multicolinearity the other). In our case R^2 is 97.2% which is a
very satisfying fitting. Of course R^2 is not enough alone to present a good model, that
is why we did first the analysis above.
R^2 adjusted
Adjusted R^2 in contrast with R^2 explains the variation explained by only those
independent variables that in reality affect the dependent variable. In contrast with R^2
which only increases with added explanatory variables, this coefficient can decrease
when a predictor enhances the model less than what is predicted. As in the first case the
higher this variable the better (no zero or one). In our case is it is 0.96% which is again
a great number.
Overall F-test
The F-test evaluates the null hypothesis that all regression coefficients are equal to zero
versus the alternative that at least one does not. A significant F-test indicates that the
observed R-squared is reliable, and is not a spurious result of oddities in the data set.
Thus, the F-test determines whether the proposed relationship between the response
variable and the set of predictors is statistically reliable.
Hypothesis:
44
H0: β1 = β2 = ... = βp-1 = 0 , (fit of intercept only model and ours equal)
H1: βj ≠ 0, for at least one value of j
F = (RSSH − RSS)/ (p − 1)/ ((RSS/ (n − p)) ∼ Fp−1, n−p,
F value=56.65
Fcri=2.494291
Fv > Fcr , reject the null hypothesis, means that my model provides
better fit than the Intercept-only model
Root MSE
The RMSE is the square root of the variance of the residuals. It indicates the absolute fit
of the model to the data–how close the observed data points are to the model’s predicted
values. We can imagine it as the standard deviation of the errors. Lower values of RMSE
indicate better fit. RMSE is a good measure of how accurately the model predicts the
response variable. The best measure of that is up to the researcher. In our case is
0.15179. which is a good number and we can accept it.
Tests of significance
After the general idea of fit we should now focus on individual estimators. At this part
we want to check which of our estimators have good generalization ability to the
population also.
45
CereCrop
Hypothesis:
Ho: b2 =0 tcr = 2.119905
H1: b2 ≠0 t=2.4
| tcr| < t, so the Ho is rejected and b2 is important. In simple terms our b2 estimator
can generalized to the population.
PrWheat
Hypothesis:
Ho: b3 =0 tcr = 2.119905
46
H1: b3 ≠0 t=0.59
| tcr| >t, so the Ho is accepted. In simple terms b3 estimator has not so power to
explain the Income of agriculture.
RD
Hypothesis:
Ho: b4 =0 tcr = 2.1199055.67
H1: b4 ≠0 t=5.67
| tcr| >t, so the Ho is rejected and b4 is significant.
PrMilk
Hypothesis:
Ho: b5 =0 tcr = 2.12
H1: b5 ≠0 t= -1.16
| tcr| >t, so the Ho is accepted.
Lvgoat
Hypothesis:
Ho: b6 =0 tcr = 2.12
H1: b6≠0 t= -5.23
| tcr| >t, so the Ho is rejected and the b6 estimator is significant
Prgoat
Hypothesis:
Ho: b7 =0 tcr = 2.12
47
H1: b7≠0 t= 1.84
| tcr| >t, so the Ho is accepted.
LVpig
Hypothesis:
Ho: b8 =0 tcr = 2.12
H1: b8≠0 t= 1.10
| tcr| >t, so the Ho is accepted.
Prpig
Hypothesis:
Ho: b9 =0 tcr = 2.12
H1: b9≠0 t= -0.65
| tcr| >t, so the Ho is accepted.
Prdiesel
Hypothesis:
Ho: b10=0 tcr = 2.12
H1: b10≠0 t= -0.11
| tcr| >t, so the Ho is accepted.
Fert
Hypothesis:
Ho: b11 =0 tcr = 2.12
48
H1: b11≠0 t= -0.11
| tcr| >t, so the Ho is accepted.
Even though we accepted the null hypothesis for many variables. This does not concern
us so much, because those with great Impact in our Income appeared significant.
Interpretation of Significant variables
In this part we close our research with the final step. We will try to recommend a policy
plan based on the conclusions of our model.
We recall that our target variable Income is under log transformation.
CereCrop: it is under of log transformation to address skewness. =0.21
As both variables are under log transformation we can say that 1% increase in Crops
Production will increase the Income in EU at 22%. We don’t forget that our Income is
expressed in terms of profit fixed without taxes costs and including subsidies. But this
does not affect our results because our research is focus in how increase the total wealth
in all the countries. So we are free to express in terms of profits. Also for the reason that
we already referred about the specific policy for agriculture products.
Proposal Cereal Crops
As this variable is very important to agriculture wages EU should try to support this
sector somehow. We realize that a push up of the production is not an easy idea,
probably because we cannot ensure the consumption from the customer. Thus, we
realized that EU could try to increase consumption with specific policy, such as by
putting these kinds of products in militaries or schools. Or, probably should give a
support subsidy to those who are in this kind of activity per amount of production.
Lvgoat: it is under of log transformation to build skewness. =0.31
For the same reasons as previous we can say that 1% increase in livestock of goats will
increase Income by 31%. The thing with these kind of animals is that they participate in
two kind of markets. First their milk goes to dairy product factories, but also their meat
to supermarkets. As a recommendation in this part we have a straight forward subsidy
to these works and a restriction on meat slaughtering of goats. Then the subsidy can go
49
to meat sellers probably for supporting them with the decreasing demand. Of course this
is a speculation because we don’t know the price elasticity of this product.
RD: it is under of log transformation to build skewness. =0.43
Finally, we have that a 1% increase in Research & Development will increase Income
43%. What we get here is that EU should spend more time considering ways to improve
productivity through development. Supporting farms with machinery, or new kind of
fertilizers for increased efficiency. But, research gives the flexibility to play also with the
cost. New technology or techniques can be discovered, which decreases very satisfyingly
the cost of the producer.
Conclusions
Under the model that we built, we realized that our initial opinion that prices are too
important is not true. We understand of course, that all of the prices probably come
from the speciality of this sector and some in elasticities that it has. Thus EU should be
production oriented. And after all we see that Research and development is the most
important component to put under strong consideration.
II. Appendix- Model Trials
.
1. Basic Linear Model Income =β0 +β1*CereCrop+β2*Prwheat+β3*Rawmilk+β4*Prmilk+
β5*LVgoat+β6*Prgoat+β7*LVpig+β8*Prpig+β9*RD+β10*Prdiesel+β11*Fert
51
2. Log –linear Model (except Prmilk, Prdiesel and Square Fert Income =β0+ β1*logCerecrop+β2*logPrWheat+β3*logRawmilk+β4*Prmilk+
β5*logLVgoat+β6*logPrgoat+β7*logLVpig+β8*logPrpig+β9*logRD+
β10*Prdiesel+β11*sqFert
53
3. Log –linear Model (except Prmilk, Prdiesel) Income
=β0+β1*logCerecrop+β2*logPrWheat+β3*logRawmilk+β4*Prmilk+β5*logLVgoat+
β6*logPrgoat+β7*logLVpig+β8*logPrpig+β9*logRD+β10*Prdiesel+β11*logFert
54
4. Log –linear Model (except Prmilk, Prdiesel and Square Prwheat) Income =β0+β1*logCerecrop +β2*sqPrWheat+β3*logRawmilk+
β4*Prmilk+β5*logLVgoat+ β6*logPrgoat+β7*logLVpig+β8*logPrpig+
β9*logRD+β10*Prdiesel+β11*logFert
It makes it worse. So rejected transformation
55
5. Square Root –linear Model Income =β0 +β1*sqrCereCrop+β2*sqrPrWheat+β3*sqρRawmilk+β4*sqrPrmilk+
β5*sqrLVgoat+ β6*sqrPrgoat+β7*sqrLVpig+β8*sqrPrpig+β9*sqrRD+
β10*sqrPrdiesel+β11*sqrFert
This model is problematic in terms of some estimators. We can see again Diesel price
positive. But, we should notice the great in terms of fertilizers.
We meet again the problematic residual table for Prwheat but now also for LVgoat
and PrGoat
57
6. Square Root –linear Model (except LVgoat, PrGoat, Prwheat,
Prdiesel) Income =β0+β1*CereCrop +β2*Prwheat+β3*sqRawmilk+β4*Prmilk+β5*LVgoat+
β6*Prgoat+β7*sqLVpig+β8*sqPrpig+β9*sqRD+β10*Prdiesel+β11*sqFert
No. This model is not a good approach. We can see good variables losing their
abilities. We have increased heteroscedasticity.
59
7. Square–linear Model Income =β0+β1*CereCrop +β2*sqPrWheat+β3*sqRawmilk+β4*sqPrmilk+
β5*sqLVgoat+ β6*sqPrgoat+β7*sqLVpig+β8*sqPrpig+β9*sqRD+
β10*sqPrdiesel+β11*sqFert
In the same way as the previous examples, this model is not appropriate
61
8. X^3–linear Model Income
=β0+β1*CereCrop^3+β2*sqPrWheat^3+*β3*sqRawmilk^3+β4*sqPrmilk^3+
β5*sqLVgoat^3+ β6*sqPrgoat^3+β7*sqLVpig^3+β8*sqPrpig^3+β9*sqRD^3+
β10*sqPrdiesel^3+β11*sqFert^3
We can see that this model works good not best in PrMilk variable. Generally, it is
not a good model.
64
9. logY–linear Model logIncome =β0+β1*CereCrop +β2*Prwheat+β3*Rawmilk+β4*Prmilk+β5*LVgoat+
β6*Prgoat+β7*LVpig+β8*Prpig+β9*RD+β10*Prdiesel+β11*Fert
66
10.Y^2–linear Model
Income^2=β0+β1*CereCrop +β2*Prwheat+*β3*Rawmilk+β4*Prmilk+β5*LVgoat+
β6*Prgoat+β7*LVpig+β8*Prpig+β9*RD+β10*Prdiesel+β11*Fert
It is a little bit worse than the previous.
68
In the next section we will start to test different combination for
each variable.
Mix Models The models that examined are the following
Transformations
Variables Model 11 Model12 Model13 Model14 Model15
CereCrop Log Sqrt Log Sqrt Sqrt
Prwheat Log log Log log log
Rawmilk log Sqrt log Sqrt Sqrt
Prmilk None Log X^3 Log Log
LVgoat Log Log Log Log Log
LVpig Log Log Log Log Log
Prgoat Log Log Log Log Log
Prpig Log Log Log Log Log
RD Log Sqrt sqrt Sqrt Sqrt
Prdiesel Log Sqrt Log Sqrt Sqrt
Fert log log log log log
Income(target) None None log log sqrt
.
11. Mix 11 –linear Model .
Income=β0+β1logCereCrop +β2*logPrWheat+β3*logRawmilk+β4*Prmilk^3+
β5*logLVgoat+ β6*logPrgoat+β7*logLVpig+β8*logPrpig+β9*sqrRD+
β10*logPrdiesel+β11*logFert
71
12. Mix 12 –linear Model
.
Income=β0+β1sqrCereCrop +β2*logPrWheat+β3*sqrRawmilk+sqr*Prmilk^3+
β5*logLVgoat+ β6*logPrgoat+β7*logLVpig+β8*logPrpig+β9*sqrRD+
β10*sqrPrdiesel+β11*logFert
73
13. Mix 13 –linear Model
logIncome=β0+β1logCereCrop +β2*logPrWheat+β3*logRawmilk+β4*Prmilk^3+
β5*logLVgoat+ β6*logPrgoat+β7*logLVpig+β8*logPrpig+β9*sqrRD+
β10*logPrdiesel+β11*logFert
75
14. Mix 14 –linear Model
logIncome=β0+β1sqrCereCrop +β2*logPrWheat+β3*sqrRawmilk+sqr*Prmilk^3+
β5*logLVgoat+ β6*logPrgoat+β7*logLVpig+β8*logPrpig+β9*sqrRD+
β10*sqrPrdiesel+β11*logFert
77
15. Mix 15 –linear Model
.
sqrIncome=β0+β1sqrCereCrop +β2*logPrWheat+β3*sqrRawmilk+sqr*Prmilk^3+
β5*logLVgoat+ β6*logPrgoat+β7*logLVpig+β8*logPrpig+β9*sqrRD+
β10*sqrPrdiesel+β11*logFert
79
4. References sites that we trusted (opinions and theory)
Tools
Sas Miner
Sas 9.3
Xlstat
Data collection
http://www.worldbank.org
http://ec.europa.eu/eurostat
http://www.gapminder.org/data/
Assumptions of OLS
https://www.ecu.edu/cs-dhs/bios/upload/SAS_Regression-2012.pdf
http://www.ats.ucla.edu/stat/sas/webbooks/reg/chapter2/sasreg2.htm
https://www.uvm.edu/~wgibson/Classes/200f09/Technical_notes/Hausman.pdf
http://nationalekonomi.hannes.se/regression-analysis/assumptions
http://www.statisticssolutions.com/homoscedasticity/
https://www3.nd.edu/~rwilliam/stats2/l25.pdf
http://www.lexjansen.com/wuss/2006/posters/POS-Ayyangar.pdf
https://onlinecourses.science.psu.edu/stat501/node/347
http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/#x-
unbalanced-header
https://en.wikipedia.org/wiki/Linearity
https://www.ine.pt/revstat/pdf/rs160105.pdf
http://stats.stackexchange.com/questions/55888/zero-conditional-mean-assumption
http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/
https://jrvargas.files.wordpress.com/2011/01/wooldridge_j-
_2002_econometric_analysis_of_cross_section_and_panel_data.pdf
80
https://www3.nd.edu/~rwilliam/stats1/x92.pdf
http://www.statistics4u.info/fundstat_eng/ee_shapiro_wilk_test.html
http://www.stat.purdue.edu/~tqin/system101/method/QQplot_sas.htm
http://blog.minitab.com/blog/adventures-in-statistics/what-are-the-effects-of-
multicollinearity-and- when-can-i-ignore-them
Interpretation
http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-
analysis-results-p-values-and-coefficients
http://blogs.sas.com/content/iml/2013/06/12/interpret-residual-fit-spread-plot.html
http://blog.minitab.com/blog/adventures-in-statistics/what-is-the-f-test-of-overall-
significance-in-regression-analysis
http://www.geosci-model-dev.net/7/1247/2014/gmd-7-1247-2014.pdf
http://muscle.ucsd.edu/More_HTML/papers/pdf/Lieber_JOR_1990.pdf
http://www.reed.edu/economics/course_pages/red_spots/testing_hypotheses.htm
Agriculture Policy
http://ec.europa.eu/agriculture/statistics/agricultural/2013/pdf/c5-1-351_en.pdf
http://www.gapminder.org/data/ http://ec.europa.eu/agriculture/50-years-of-cap/files/history/history_book_lr_en.pdf