COMPARING THE EFFICIENCY OF REGRESSION AND RATIO ESTIMATION

EFFICIENCY OF REGRESSION AND RATIO ESTIMATION

~ KRITI MAHNOT

CHRIST (DEEMED TO BE) UNIVERSITY, BENGALURU

The blog aims to present the efficiency between Regression and Ratio Estimation along with the comparison between the two. We analyze which estimation technique is better, with the help of R programming.

In many surveys, information on auxiliary variable which is highly correlated with the variable of interest is readily available and can be used for improving the sampling design.

However, at times such information on individual auxiliary variable is not available. In such instances, aggregate auxiliary variate can still be used to to estimate the parameters. The methods used in such estimation are known as "Ratio Method of Estimation" and "Regression Method of Estimation".

RATIO ESTIMATION :

Ratio estimation is a technique that uses available auxiliary information which is correlated with the variable of interest. It is a statistical parameter which is defined to be the ratio of means of two random variables.

Ratio estimates are biased and corrections must be made when they are used in experimental or survey work. They are asymmetrical and symmetrical tests such as the t test should not be used to generate confidence intervals. This technique of estimating parameters is used in Simple Random Sampling (SRS) as well as in Stratified Sampling.

Under Stratified sampling, the two methods of Ratio Estimation include- Separate Ratio Estimator and Combined Ratio Estimator.

If the stratum sample sizes are large (more than 20) it is better to use separate ratio estimators. Otherwise, if the sample sizes are small or the within-stratum ratios are approximately equal, it is better to use combined ratio estimators

The notations used in Ratio Estimation are as follows:

yi- the value of the characteristic under study for the i-th unit of the population

xi- value of the auxiliary characteristic on the same unit

Y- the total of y characteristic of the population

X- the total of x characteristic of the population

x̄- sample mean of x character

ȳ- sample mean of y character

R=Y/X=Ȳ/X̅= Ratio of the population totals or means of y and x

ρ- the correlation coefficient between x and y in the population

The ratio estimators of the population ratio Y/X=R, the total Y and the mean Ȳ may be defined as:

In some samples, the distribution (Rcap) is skewed and( Rcap) is slightly biased estimator of R. But for the large samples, the distribution of (Rcap )tends to follow normal distribution and the bias becomes negligible.

FORMULAE FOR RATIO ESTIMATION:

1. The sample ratio (r) is estimated from the sample-

r={\frac {\bar {y}}{\bar {x}}}={\frac {\sum _{i=1}^{n}y}{\sum _{i=1}^{n}x}}

2. ESTIMATE TOTAL:

The estimated total of the y variate ( τ_y ) is

\tau _{y}=r\tau _{x}

where ( τ_x ) is the total of the x variate.

3. VARIANCE ESTIMATES:

a. The Variance of the Sample Ratio is approximately:

\operatorname {var}(r)={\frac {1}{s_{x}^{2}+m_{x}^{2}}}\left[(s_{y}^{2}-s_{{x^{2}[y^{2}/x^{2}]}})-(s_{{x[y/x]}})^{2}+2m_{y}s_{{x[y/x]}}-{\frac {s_{x}^{2}}{m_{x}^{2}}}(m_{y}-s_{{x[y/x]}}^{2})\right]

where s x 2 and s y 2 are the variances of the x and y variates respectively, m x and m y are the means of the x and y variates respectively and s ab is the covariance of a and b .

\operatorname {var} (r)={\frac {N-n}{N}}{\frac {1}{m_{x}^{2}}}{\frac {\sum _{i=1}^{n}(y_{i}-rx_{i})^{2}}{n-1}}

b. The Variance of the estimated Total is:

\operatorname {var} (\tau _{y})=\tau _{y}^{2}\operatorname {var} (r)

c. The Variance of the estimated Mean of the y variate is:

\operatorname {var} ({\bar {y}})=m_{x}^{2}\operatorname {var} (r)={\frac {N-n}{N}}{\frac {\sum _{i=1}^{n}(y_{i}-rx_{i})^{2}}{n-1}}={\frac {N-n}{N}}{\frac {(s_{y}^{2}+r^{2}s_{x}^{2}-2r\rho s_{x}s_{y})}{n}}

where m_x is the mean of the x variate, s_x² and s_y² are the sample variances of the x and y variates respectively and ρ is the sample correlation between the x and y variates.

d. In Simple Random Sampling Without Replacement (SRSWOR), for large n, an approximation to the variance of Rcap is given by:

where, f=n/N is the sampling fraction.

e. Confidence Limits:

For large samples, the estimate of the mean or total may be assumed to follow approximately normal distribution. The CI for the total will be written as:

where, z is the value of the normal variate for a given level of confidence coefficient.

Similarly, the limits for R may be written as:

When the population is stratified and units are drawn by simple random sampling method from each stratum, there are two ways of obtaining a ratio estimate of the population total Y- SEPARATE Ratio Estimate, and COMBINED Ratio Estimate.

Applications of Ratio Estimation include:

1. The high correlation between x and y variates through the origin.

2. In survey methodology when estimating a weighted average in which the denominator indicates the sum of weights that reflect the total population size, but the total population size is unknown.

REGRESSION ESTIMATION:

Like ratio estimators, linear regression estimators also make use of auxiliary information for increasing precision. The ratio estimator provides a precise estimate of the population mean if regression is linear and the line passes through the origin. When the regression is linear but the line does not pass through the origin, we use estimators based on linear regression.

If the study variate (y) is approximately a constant, and a multiple of the auxiliary variate, its is more precise to estimate the population mean or total by fitting a linear regression. Such an estimator is called a "Regression Estimator".

Thus, Regression Analysis is a statistics tool for investigating relationship between a dependent or response variable (y) and an independent or regressor variable (x).

The correlation between both the variables is represented by a mathematical diagram- "Scatter Plot".

The parameters in this analysis are estimated using the "Method of Least Squares" or "Ordinary Least Squares". The model involves the following components:

1. The unknown parameters, often denoted as a scalar or vector $\beta$ .

2. The independent variables (observed in data) and denoted as a vector $X_{i}$ .

3. The dependent variable (observed in data) and denoted using the scalar $Y_{i}$

4. The error terms, which are not directly observed in data and are often denoted using the scalar $e_{i}$ The residual, $e_{i}=y_{i}-{\widehat {y}}_{i}$ , is the difference between the value of the dependent variable predicted by the model, ${\widehat {y}}_{i}$ , and the true value of the dependent variable, $y_{i}$ .

In linear regression, the model specification is that the dependent variable, $y_{i}$ is a linear combination of the parameters. In simple linear regression for modeling $n$ data points there is one independent variable: $x_{i}$ , and two parameters, $\beta _{0}$ and $\beta _{1}$ :

straight line:

y_{i}=\beta _{0}+\beta _{1}x_{i}+\varepsilon _{i},\quad i=1,\dots ,n.\!

In multiple linear regression, there are several independent variables or functions of independent variables.

Adding a term in $x_{i}^{2}$ to the preceding regression gives:

parabola:

y_{i}=\beta _{0}+\beta _{1}x_{i}+\beta _{2}x_{i}^{2}+\varepsilon _{i},\ i=1,\dots ,n.\!

Suppose xi and yi are obtained for each units in the sample, then a least square estimate of β is given by:

Thus the linear regression estimators of the population mean Ȳ and the population total Y are given by:

The regression estimator ȳ is biased since-
1. β is generally estimated by taking the ratio of the estimate of Cov (ȳ,x̄) to that of V(x̄), and

2. it involves the product of two estimates, viz b.x̄. The bias of the regression estimator will usually be trivial and will decrease as sample size increases.

Apart from the least square method, some other methods to find the estimate include:

1. Bayesian Linear Regression

2. Least absolute Deviations

3. Nonparametric Regression

4. Scenario Optimization

Applications of Regression Analysis include:

1. Prediction and forecasting, the use of which overlaps with that of machine learning

2. To infer "causal relationship" between dependent and independent variables.

COMPARING THE EFFICIENCY OF RATIO AND REGRESSION ESTIMATION:

Ratio and Regression estimators are compared on the basis of bias and coefficient of variation.
Theoretically, on comparing the two on this basis, we see that Ratio estimation is better than Regression estimation when regression line is close to the origin. Ratio and regression estimators still work even if there is a weak linear relationship between x and y.

To verify the results, we take an example of a practical problem using R programming and analyze the results.

The example states that Mr. John selects 21 states from a population of 50 states of a country and collects the information about the real estate farm loans and nonreal estate farm loans.

We compare the results obtained using Ratio estimator and Regression Estimation and suggest the more efficient method of estimation.

Given that, x-non-real estate farm loans (Auxiliary variable) and y-real estate farm loans (Variable of Interest).

The first step is to install the packages and then importing the data

#Installing the packages and required libraries.
library(SDaA)
library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

##
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
##
## dotchart

library(readxl)
dataset=read_excel("C:/Users/Kirti/Desktop/dataset.xlsx")
head(dataset)

##
A tibble: 6 x 2

##         x       y
##     <dbl>   <dbl>
## 1  348.    409.
## 2  431.     54.6
## 3  848.    908.
## 4 3929.   1343.
## 5  906.    316.
## 6    4.37    7.13

colnames(dataset)

## [1] "x" "y"

nrow(dataset)

## [1] 21

To check the correlation between x and y and the value of this correlation, we use the following codes.

#Checking the correlation using scatter plot
plot(dataset$x, dataset$y)

#value of correlation
cor(dataset$x, dataset$y)

## [1] 0.664419

Now, we name the auxiliary variable, i.e., ‘x’, and use regressor model function which gives us the regression coefficient, ‘b’. The weights are assigned because using the auxiliary variable (x), we estimate y.

regx=dataset$x
reg_model=lm(y~0+x,weights=regx,data=dataset) ; reg_model

##
## Call:
## lm(formula = y ~ 0 + x, data = dataset, weights = regx)
##
## Coefficients:
## x
## 0.3982

summary(reg_model)

##
## Call:
## lm(formula = y ~ 0 + x, data = dataset, weights = regx)
##
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max
## -13848.0    -83.8    658.4   9843.9 28282.0
##
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x 0.39819    0.03368   11.82 1.76e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12220 on 20 degrees of freedom
## Multiple R-squared: 0.8749, Adjusted R-squared: 0.8686
## F-statistic: 139.8 on 1 and 20 DF, p-value: 1.764e-10

#estimated b is 0.3982
b=0.3982
#regression estimator is ybar_reg=ybar+b*(Xbar-xbar)
ybar=mean(dataset$y) ; ybar

## [1] 602.5512

xbar=mean(dataset$x) ; xbar

## [1] 899.3581

Xbar=878.16 ; Xbar #known

## [1] 878.16

ybar_reg=ybar+b*(Xbar-xbar)
ybar_reg #estimated population mean using regression estimator

## [1] 594.1101

Now, the variance using regression estimator is calculated.

#standard error-First finding the unbiased estimate of variance ybar and then the square root of it. For this, we need correlation coefficient (r),N,n and sample mean square of y (sy2)

r=cor(dataset$y,dataset$x)
N=50
n=21
sy2=var(dataset$y) #var is the estimated value of S^2 for variable y
V=((N-n)/(N*n))*(sy2-(r^2)*sy2) ; V

## [1] 4645.556

SE=sqrt(V) ; SE

## [1] 68.15831

Now, we find the variance of ratio estimation and compare both-ratio and regressor-estimations to find the efficient method of the estimate.

#Estimation of the Variance by Ratio estimation
#Survey Sample Analysis and Ratio Estimation
re=svydesign(ids=~1,weights=~1,data=dataset)
#Ratio estimation
svyratio(numerator=~dataset$y,denominator=dataset$x,design=re)

## Ratio estimator: svyratio.survey.design2(numerator = ~dataset$y, denominator = dataset$x,
##     design = re)
## Ratios=
##
## dataset$y 0.6699792
## SEs=
##                [,1]
## dataset$y 0.1380009

1. The estimate of the Ratio is 0.6699792.

2. The Standard Error of the estimate of the Ratio is 0.1380009 i.e.; on changing the samples, we will see a deviation of 0.1380009 in the estimated values of the ratio.

#Estimating the Variance of the Estimate
sx2=var(dataset$x) #variance is the estimated value of S^2 for the variable X
sx=sqrt(sx2)
sy=sqrt(sy2)
r=0.664419
R=0.6699792
V1=(((N-n)/(n*N))*(sy2+(R^2)*sx2-(2*R*r*sx*sy)))
V1

## [1] 8934.233

INTERPRETATION: We can clearly see that the Variance of the estimate of Ybar by Ratio estimation (V1=8934.233) is greater than the Variance of the estimate of Ybar computed by Regression estimation (V=4645.556).

We know that the estimation with lesser variance is an efficient estimation method and hence, the Regression method of estimation is more efficient than Ratio method of estimation.

CONCLUSION:

In general, we can conclude that if a linear relationship between x and y variates exists and therefore the regression of y on x passes through the origin, then the calculable variance of regression of y on x is often less than that of the ratio estimator. The precise relationship between the variances depends on the dimensionality or linearity of the relation between the x and y variates. When the relation is apart from linear, the ratio estimate could have a lower variance than that estimated by regression.

Search This Blog

Complex Sample Survey Designs

COMPARING THE EFFICIENCY OF REGRESSION AND RATIO ESTIMATION - 2148137

Comments

Post a Comment

Popular posts from this blog

Population Proportion of Size Without Replacement Using DesRaj Estimator

Probability Proportional to Size Sampling without replacement (PPSWOR) using Murthy’s unordered estimator

PPSWOR AND HORVITZ THOMPSON ESTIMATOR