SAMPLING FOR PROPORTIONS UNDER WITHOUT REPLACEMENT


Name: JOANNA JONES 

Registration No.: 2148134

DEPARTMENT OF STATISTICS

CHRIST (DEEMED TO BE UNIVERSITY), BENGALURU 



SAMPLING FOR PROPORTIONS UNDER WITHOUT REPLACEMENT 


In several instances, the characteristic under study on which the observations are gathered is qualitative or categorical in nature. For example, the items coming off an assembly line are defective or not; or in marketing surveys, the customers are asked to arrange their preferences in order, like first preference, second preference, etc. In these scenarios, the question that arises is how does one carry out the sampling and how can the population parameters like population mean, population total, etc. be estimated.

Thus, in this blog, we shall go about answering these questions and also solve an example using R. 


Procedure for Sampling

The procedure to draw a sample possessing a qualitative characteristic is the same as drawing a sample in the case of quantitative characteristics. Hence, the procedure for sampling remains the same irrespective of the nature of the characteristic under study, be it qualitative or quantitative. For instance, the procedure to draw samples using Simple Random Sampling Without Replacement (SRSWOR) and Simple Random Sampling With Replacement (SRSWR) remain the same for qualitative and quantitative characteristics. In the same way, other sampling methods such as stratified sampling, systematic sampling, etc., also remain the same.


Estimation of Population Proportion:

In the population, the units are classified into two categories:

  1.  Possessing a particular characteristic 
  2.  Not possessing that characteristic 

We shall consider the example of items coming off an assembly line being defective or not defective. If the item is defective, we say that it possesses the characteristic, 'defective' and will be denoted as C. If it is not defective, we say that it does not possess the particular characteristic, 'defective', and will be denoted as C*. Thus, the population is divided into two mutually exclusive classes, C and C*.

Let A be the number of units in C and (N - A) be the number of units in C*, which are in a population of size N. Then, the proportion of units in C is

and the proportion of units in C* is 

We can associate the characteristic under study with an indicator variable Yi for i = 1, 2, ..., N as having value 1, if the ith unit belongs to C and 0, if the ith unit belongs to C*. In other words

The population total is


and the population mean is 

Hence, the question of estimating the population proportion becomes that of estimating the population mean by defining the variable Yi as above.






Estimation of Sample Proportion:

Now, consider a sample of size n is drawn from a population of size N by simple random sampling (SRS).

Let a be the number of units in the sample which fall into class C and (n - a) units fall in class C*, then the sample proportion of units in C is  

which can also be written as 


Estimation of Variance:

Since


thus, we can write S2 and s2 in terms of P and Q as follows:


Similarly, as 


and


NOTE: 

The quantities y̅, s2 are used to express the functions of sample proportions, and Y̅ S2 are used to express the functions of population proportions. 

As the sample has been drawn using simple random sampling, and sample proportion is the same as the sample mean, so the properties of sample proportion in SRSWOR and SRSWR can be derived using the properties of the sample mean directly. 


SRSWOR

Mean:

The sample mean y̅ is an unbiased estimator of the population mean Y̅, i.e. 

in the case of SRSWOR, hence 

and thus, p is as unbiased estimator of P.

Variance:

Now, we shall use the expression of Var() so as to derive the variance of p:

Similarly, we shall use the estimate of Var(), so as to derive the estimate of the variance of p: 



Estimation of Population Total or Total Number of Count:

The estimate of the population total A (or total number of count) is



its variance is 


and the estimate of its variance is



Confidence Interval Estimation of P:

If N and n are large then, 


approximately follows N(0,1). With this approximation, we can write 


and the 100(1 -  α)% confidence interval of P is  

 



Applications of Sampling for Proportions:

  1.  Television ratings are determined by estimating the proportion of the public who   watch a certain program.  
  2.  Responses of customers in marketing surveys that are based on replies such as ‘yes’   or ‘no’, ‘agree’ or ‘disagree’, etc.
  3.  Estimating the population proportion or the percentage of brown-eyed people,   unemployed people,  those who have graduated or people favoring a proposal, etc.
  4.  Proportion of all items coming off an assembly line that is defective or not.
  5.  Proportion of all people entering a retail store who purchase before leaving.

Thus, there is a wide range of applications pertaining to sampling for proportions.


Explanation of Sampling for Proportions Using R

The data set chosen refers to the Prostate Cancer data. Prostate cancer begins when cells in the prostate gland start to grow out of control. The prostate gland is found only in males, which produces seminal fluid. This data set has been taken from the website Kaggle and the URL for the same is https://www.kaggle.com/sajidsaifi/prostate-cancer.



We see that there are 100 observations of 10 variables. We need to estimate proportions, hence we shall specifically analyze the variable - diagnosis_result, which describes whether the diagnosis results in malignant, meaning the cancer can grow and spread to other parts of the body and it is denoted by 1, or benign, meaning the cancer can grow but not spread and it is denoted by 0.




Summary of the Report

The Prostate Cancer dataset was utilized to carry out the analysis. Prostate cancer begins when cells in the prostate gland start to grow out of control. 
  • The population size was 100 and the sample size was 20. Using the Sprop() function, we observed that 65% of the patients were diagnosed as malignant in the population, under the proportion estimate. Also the standard error was 0.0979, which refers that on an average 0.0979 difference will be there in the estimates of the results of the patients diagnosed as malignant in the different samples of size 20 from the population of size 100.
  • The estimate of the population total is 65 patients who were diagnosed as malignant.   
The Confidence Intervals calculated are done by assuming that the population is approximately normally distributed.
  • The 95% Confidence Limits / Interval for the population proportion of the number of patients diagnosed as malignant was [0.4582, 0.8418]. This provides the range in which the proportion of patients diagnosed as malignant will vary from sample to sample. Thus, a minimum of 0.4582 patients are diagnosed as malignant and a maximum of 0.8418 patients are diagnosed as malignant.   
  • The 95% Confidence Limits / Interval for the population total of the number of patients diagnosed as malignant was [46, 84]. This provides the range in which the total number of patients diagnosed as malignant will vary from sample to sample. Thus, a minimum of 46 patients are diagnosed as malignant and a maximum of 84 patients are diagnosed as malignant.


Comments

Popular posts from this blog

PPSWOR AND HORVITZ THOMPSON ESTIMATOR

Population Proportion of Size Without Replacement Using DesRaj Estimator

HORVITZ-THOMPSON ESTIMATOR - An Unordered Estimator