SAMPLING FOR PROPORTIONS UNDER WITHOUT REPLACEMENT
Name: JOANNA
JONES
Registration No.: 2148134
DEPARTMENT OF STATISTICS
CHRIST (DEEMED TO BE UNIVERSITY), BENGALURU
SAMPLING FOR PROPORTIONS UNDER WITHOUT REPLACEMENT
In several instances, the characteristic under study on which the observations are gathered is qualitative or categorical in nature. For example, the items coming off an assembly line are defective or not; or in marketing surveys, the customers are asked to arrange their preferences in order, like first preference, second preference, etc. In these scenarios, the question that arises is how does one carry out the sampling and how can the population parameters like population mean, population total, etc. be estimated.
Thus, in this blog, we shall go about answering these questions and also solve an example using R.
Procedure for Sampling
The procedure to draw a sample possessing a qualitative characteristic is the same as drawing a sample in the case of quantitative characteristics. Hence, the procedure for sampling remains the same irrespective of the nature of the characteristic under study, be it qualitative or quantitative. For instance, the procedure to draw samples using Simple Random Sampling Without Replacement (SRSWOR) and Simple Random Sampling With Replacement (SRSWR) remain the same for qualitative and quantitative characteristics. In the same way, other sampling methods such as stratified sampling, systematic sampling, etc., also remain the same.
Estimation of Population Proportion:
In the population, the units are classified into two categories:
- Possessing a particular characteristic
- Not possessing that characteristic
We shall consider the example of items coming off an assembly line being defective or not defective. If the item is defective, we say that it possesses the characteristic, 'defective' and will be denoted as C. If it is not defective, we say that it does not possess the particular characteristic, 'defective', and will be denoted as C*. Thus, the population is divided into two mutually exclusive classes, C and C*.
Let A be the number of units in C and (N - A) be the number of units in C*, which are in a population of size N. Then, the proportion of units in C is
and the proportion of units in C* is
We can associate the characteristic under study with an indicator variable Yi for i = 1, 2, ..., N as having value 1, if the ith unit belongs to C and 0, if the ith unit belongs to C*. In other words
The population total is
Hence, the question of estimating the population proportion becomes that of estimating the population mean by defining the variable Yi as above.
Estimation of Sample Proportion:
Now, consider a sample of size n is drawn from a population of size N by simple random sampling (SRS).
Let a be the number of units in the sample which fall into class C and (n - a) units fall in class C*, then the sample proportion of units in C is
which can also be written as
Estimation of Variance:
thus, we can write S2 and s2 in terms of P and Q as follows:
and
NOTE:
The quantities y̅, s2 are used to express the functions of sample proportions, and Y̅ S2 are used to express the functions of population proportions.
As the sample has been drawn using simple random sampling, and sample proportion is the same as the sample mean, so the properties of sample proportion in SRSWOR and SRSWR can be derived using the properties of the sample mean directly.
SRSWOR
Mean:
The sample mean y̅ is an unbiased estimator of the population mean Y̅, i.e.
in the case of SRSWOR, hence
and thus, p is as unbiased estimator of P.
Variance:
Now, we shall use the expression of Var(y̅) so as to derive the variance of p:Similarly, we shall use the estimate of Var(y̅), so as to derive the estimate of the variance of p:
Estimation of Population Total or Total Number of Count:
The estimate of the population total A (or total number of count) is
its variance is
Confidence Interval Estimation of P:
If N and n are large then,
approximately follows N(0,1). With this approximation, we can write
Applications of Sampling for Proportions:
- Television ratings are determined by estimating the proportion of the public who watch a certain program.
- Responses of customers in marketing surveys that are based on replies such as ‘yes’ or ‘no’, ‘agree’ or ‘disagree’, etc.
- Estimating the population proportion or the percentage of brown-eyed people, unemployed people, those who have graduated or people favoring a proposal, etc.
- Proportion of all items coming off an assembly line that is defective or not.
- Proportion of all people entering a retail store who purchase before leaving.
Explanation of Sampling for Proportions Using R
The data set chosen refers to the Prostate Cancer data. Prostate cancer begins when cells in the prostate gland start to grow out of control. The prostate gland is found only in males, which produces seminal fluid. This data set has been taken from the website Kaggle and the URL for the same is https://www.kaggle.com/sajidsaifi/prostate-cancer.
We see that there are 100 observations of
10 variables. We need to estimate proportions, hence we shall
specifically analyze the variable - diagnosis_result, which describes whether
the diagnosis results in malignant, meaning the cancer can grow and spread to
other parts of the body and it is denoted by 1, or benign, meaning the cancer
can grow but not spread and it is denoted by 0.
Summary of the Report
- The population size was 100 and the sample size was 20. Using the Sprop() function, we observed that 65% of the patients were diagnosed as malignant in the population, under the proportion estimate. Also the standard error was 0.0979, which refers that on an average 0.0979 difference will be there in the estimates of the results of the patients diagnosed as malignant in the different samples of size 20 from the population of size 100.
- The estimate of the population total is 65 patients who were diagnosed as malignant.
- The 95% Confidence Limits / Interval for the population proportion of the number of patients diagnosed as malignant was [0.4582, 0.8418]. This provides the range in which the proportion of patients diagnosed as malignant will vary from sample to sample. Thus, a minimum of 0.4582 patients are diagnosed as malignant and a maximum of 0.8418 patients are diagnosed as malignant.
- The 95% Confidence Limits / Interval for the population total of the number of patients diagnosed as malignant was [46, 84]. This provides the range in which the total number of patients diagnosed as malignant will vary from sample to sample. Thus, a minimum of 46 patients are diagnosed as malignant and a maximum of 84 patients are diagnosed as malignant.















Comments
Post a Comment