COMPARING THE EFFICIENCY OF REGRESSION AND RATIO ESTIMATION - 2148137
EFFICIENCY OF REGRESSION AND RATIO ESTIMATION
~ KRITI MAHNOT
CHRIST (DEEMED TO BE) UNIVERSITY, BENGALURU
The blog aims to present the efficiency between Regression and Ratio Estimation along with the comparison between the two. We analyze which estimation technique is better, with the help of R programming.
In many surveys, information on auxiliary variable which is highly correlated with the variable of interest is readily available and can be used for improving the sampling design.
However, at times such information on individual auxiliary variable is not available. In such instances, aggregate auxiliary variate can still be used to to estimate the parameters. The methods used in such estimation are known as "Ratio Method of Estimation" and "Regression Method of Estimation".
RATIO ESTIMATION :
Ratio estimation is a technique that uses available auxiliary information which is correlated with the variable of interest. It is a statistical parameter which is defined to be the ratio of means of two random variables.
Ratio estimates are biased and corrections must be made when they are used in experimental or survey work. They are asymmetrical and symmetrical tests such as the t test should not be used to generate confidence intervals. This technique of estimating parameters is used in Simple Random Sampling (SRS) as well as in Stratified Sampling.
Under Stratified sampling, the two methods of Ratio Estimation include- Separate Ratio Estimator and Combined Ratio Estimator.
If the stratum sample sizes are large (more than 20) it is better to use separate ratio estimators. Otherwise, if the sample sizes are small or the within-stratum ratios are approximately equal, it is better to use combined ratio estimators
The notations used in Ratio Estimation are as follows:
yi- the value of the characteristic under study for the i-th unit of the population
xi- value of the auxiliary characteristic on the same unit
Y- the total of y characteristic of the population
X- the total of x characteristic of the population
x̄- sample mean of x character
ȳ- sample mean of y character
R=Y/X=Ȳ/X̅= Ratio of the population totals or means of y and x
ρ- the correlation coefficient between x and y in the population
The ratio estimators of the population ratio Y/X=R, the total Y and the mean Ȳ may be defined as:
In some samples, the distribution (Rcap) is skewed and( Rcap) is slightly biased estimator of R. But for the large samples, the distribution of (Rcap )tends to follow normal distribution and the bias becomes negligible.
FORMULAE FOR RATIO ESTIMATION:
1. The sample ratio (r) is estimated from the sample-
2. ESTIMATE TOTAL:
The estimated total of the y variate ( τy ) is
where ( τx ) is the total of the x variate.
3. VARIANCE ESTIMATES:
a. The Variance of the Sample Ratio is approximately:
- straight line:
- In multiple linear regression, there are several independent variables or functions of independent variables.
Adding a term in to the preceding regression gives:
- parabola:
- Suppose xi and yi are obtained for each units in the sample, then a least square estimate of β is given by:
Although the approximate variance estimator of the ratio given below is biased, if the sample size is large, the bias in this estimator is negligible.
where N is the population size, n is the sample size and mx is the mean of the x variate.
b. The Variance of the estimated Total is:
c. The Variance of the estimated Mean of the y variate is:
where mx is the mean of the x variate, sx2 and sy2 are the sample variances of the x and y variates respectively and ρ is the sample correlation between the x and y variates.
d. In Simple Random Sampling Without Replacement (SRSWOR), for large n, an approximation to the variance of Rcap is given by:
e. Confidence Limits:
For large samples, the estimate of the mean or total may be assumed to follow approximately normal distribution. The CI for the total will be written as:
When the population is stratified and units are drawn by simple random sampling method from each stratum, there are two ways of obtaining a ratio estimate of the population total Y- SEPARATE Ratio Estimate, and COMBINED Ratio Estimate.
Applications of Ratio Estimation include:
1. The high correlation between x and y variates through the origin.
2. In survey methodology when estimating a weighted average in which the denominator indicates the sum of weights that reflect the total population size, but the total population size is unknown.
REGRESSION ESTIMATION:
Like ratio estimators, linear regression estimators also make use of auxiliary information for increasing precision. The ratio estimator provides a precise estimate of the population mean if regression is linear and the line passes through the origin. When the regression is linear but the line does not pass through the origin, we use estimators based on linear regression.
If the study variate (y) is approximately a constant, and a multiple of the auxiliary variate, its is more precise to estimate the population mean or total by fitting a linear regression. Such an estimator is called a "Regression Estimator".
Thus, Regression Analysis is a statistics tool for investigating relationship between a dependent or response variable (y) and an independent or regressor variable (x).
The correlation between both the variables is represented by a mathematical diagram- "Scatter Plot".
The parameters in this analysis are estimated using the "Method of Least Squares" or "Ordinary Least Squares". The model involves the following components:
2. The independent variables (observed in data) and denoted as a vector .
3. The dependent variable (observed in data) and denoted using the scalar .
4. The error terms, which are not directly observed in data and are often denoted using the scalar . The residual, , is the difference between the value of the dependent variable predicted by the model, , and the true value of the dependent variable, .
In linear regression, the model specification is that the dependent variable, is a linear combination of the parameters. In simple linear regression for modeling data points there is one independent variable: , and two parameters, and :
- Thus the linear regression estimators of the population mean Ȳ and the population total Y are given by:
1. β is generally estimated by taking the ratio of the estimate of Cov (ȳ,x̄) to that of V(x̄), and- 2. it involves the product of two estimates, viz b.x̄. The bias of the regression estimator will usually be trivial and will decrease as sample size increases.
- Apart from the least square method, some other methods to find the estimate include:
- 1. Bayesian Linear Regression
- 2. Least absolute Deviations
- 3. Nonparametric Regression
- 4. Scenario Optimization
- Applications of Regression Analysis include:
- 1. Prediction and forecasting, the use of which overlaps with that of machine learning
- 2. To infer "causal relationship" between dependent and independent variables.
- COMPARING THE EFFICIENCY OF RATIO AND REGRESSION ESTIMATION:
- Ratio and Regression estimators are compared on the basis of bias and coefficient of variation.
Theoretically, on comparing the two on this basis, we see that Ratio estimation is better than Regression estimation when regression line is close to the origin. Ratio and regression estimators still work even if there is a weak linear relationship between x and y. - To verify the results, we take an example of a practical problem using R programming and analyze the results.
- The example states that Mr. John selects 21 states from a population of 50 states of a country and collects the information about the real estate farm loans and nonreal estate farm loans.
- We compare the results obtained using Ratio estimator and Regression Estimation and suggest the more efficient method of estimation.
Given that, x-non-real estate farm loans (Auxiliary variable) and y-real estate farm loans (Variable of Interest).
The first step is to install the packages and then importing the data
#Installing the packages and required libraries.
library(SDaA)
library(survey)## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'## The following object is masked from 'package:graphics':
##
## dotchartlibrary(readxl)
dataset=read_excel("C:/Users/Kirti/Desktop/dataset.xlsx")
head(dataset)##
A tibble: 6 x 2## x y
## <dbl> <dbl>
## 1 348. 409.
## 2 431. 54.6
## 3 848. 908.
## 4 3929. 1343.
## 5 906. 316.
## 6 4.37 7.13colnames(dataset)
## [1] "x" "y"
nrow(dataset)
## [1] 21
To check the correlation between x and y and the value of this correlation, we use the following codes.
#Checking the correlation using scatter plot
plot(dataset$x, dataset$y)
#value of
correlation
cor(dataset$x, dataset$y)
## [1] 0.664419
Now, we name the auxiliary variable,
i.e., ‘x’, and use regressor model function which gives us the regression
coefficient, ‘b’. The weights are assigned because using the auxiliary variable
(x), we estimate y.
regx=dataset$x
reg_model=lm(y~0+x,weights=regx,data=dataset) ; reg_model
##
## Call:
## lm(formula = y ~ 0 + x, data = dataset, weights =
regx)
##
## Coefficients:
## x
## 0.3982
summary(reg_model)
##
## Call:
## lm(formula = y ~ 0 + x, data = dataset, weights =
regx)
##
## Weighted Residuals:
## Min 1Q
Median 3Q Max
## -13848.0
-83.8 658.4 9843.9
28282.0
##
## Coefficients:
## Estimate
Std. Error t value Pr(>|t|)
## x
0.39819 0.03368 11.82 1.76e-10 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12220 on 20 degrees of
freedom
## Multiple R-squared:
0.8749, Adjusted R-squared:
0.8686
## F-statistic: 139.8 on 1 and 20 DF, p-value: 1.764e-10
#estimated b is
0.3982
b=0.3982
#regression estimator is ybar_reg=ybar+b*(Xbar-xbar)
ybar=mean(dataset$y) ; ybar
## [1] 602.5512
xbar=mean(dataset$x) ; xbar
## [1] 899.3581
Xbar=878.16 ; Xbar #known
## [1] 878.16
ybar_reg=ybar+b*(Xbar-xbar)
ybar_reg #estimated
population mean using regression estimator
## [1] 594.1101
Now, the variance using regression
estimator is calculated.
#standard
error-First finding the unbiased estimate of variance ybar and then the square
root of it. For this, we need correlation coefficient (r),N,n and sample mean
square of y (sy2)
r=cor(dataset$y,dataset$x)
N=50
n=21
sy2=var(dataset$y) #var is the estimated value of S^2 for variable y
V=((N-n)/(N*n))*(sy2-(r^2)*sy2) ; V
## [1] 4645.556
SE=sqrt(V) ; SE
## [1] 68.15831
Now, we find the variance of ratio
estimation and compare both-ratio and regressor-estimations to find the
efficient method of the estimate.
#Estimation of the
Variance by Ratio estimation
#Survey Sample Analysis and Ratio Estimation
re=svydesign(ids=~1,weights=~1,data=dataset)
#Ratio estimation
svyratio(numerator=~dataset$y,denominator=dataset$x,design=re)
## Ratio
estimator: svyratio.survey.design2(numerator = ~dataset$y, denominator =
dataset$x,
## design =
re)
## Ratios=
##
## dataset$y 0.6699792
## SEs=
##
[,1]
## dataset$y 0.1380009
1.
The estimate of the Ratio is
0.6699792.
2.
The Standard Error of the
estimate of the Ratio is 0.1380009 i.e.; on changing the samples, we will see a
deviation of 0.1380009 in the estimated values of the ratio.
#Estimating the
Variance of the Estimate
sx2=var(dataset$x) #variance is the estimated value of S^2 for the variable X
sx=sqrt(sx2)
sy=sqrt(sy2)
r=0.664419
R=0.6699792
V1=(((N-n)/(n*N))*(sy2+(R^2)*sx2-(2*R*r*sx*sy)))
V1
## [1] 8934.233
INTERPRETATION: We can clearly see that the Variance of the estimate of Ybar by Ratio estimation (V1=8934.233) is greater than the Variance of the estimate of Ybar computed by Regression estimation (V=4645.556).
We know that the estimation with lesser variance is an efficient estimation method and hence, the Regression method of estimation is more efficient than Ratio method of estimation.
CONCLUSION:
In general, we can conclude that if a linear relationship between x and y variates exists and therefore the regression of y on x passes through the origin, then the calculable variance of regression of y on x is often less than that of the ratio estimator. The precise relationship between the variances depends on the dimensionality or linearity of the relation between the x and y variates. When the relation is apart from linear, the ratio estimate could have a lower variance than that estimated by regression.
Comments
Post a Comment