n or n-1 for sample standard deviation? (1 Viewer)

2822309062 · Oct 1, 2020

So I was doing some 2020 school trial papers and I found that none have included sqrt(p(1-p)/(n-1)) as the answer for sample standard deviation. Just wondering what do you guys think, would n-1 be safer in HSC or n will be fine?

Trebla · Oct 1, 2020

I would be interested to see what the questions actually ask in those papers because there is a nuance here that I don't think is easily understood.

If X is a B(n,p) random variable to model the sampling distribution of n trials of independent Bernoulli random variables, then if we define

$\hat{p}=\dfrac{X}{n}$

this means that

$\text{var}(\hat{p})=\dfrac{p(1-p)}{n}$

Notice that p is actually a parameter describing the 'population' in relation to each Bernoulli trial. In other words, if p is found directly from the problem (as the probability of success) rather than being estimated from relative frequencies in a sample, then you’re not working with a “sample” variance per se.

2822309062 · Oct 1, 2020

Trebla said:
I would be interested to see what the questions actually ask in those papers because there is a nuance here that I don't think is easily understood.

If X is a B(n,p) random variable to model the sampling distribution of n trials of independent Bernoulli random variables, then if we define

$\hat{p}=\dfrac{X}{n}$

this means that

$\text{var}(\hat{p})=\dfrac{p(1-p)}{n}$

Notice that p is actually a parameter describing the 'population' in relation to each Bernoulli trial. In other words, if p is found directly from the problem (as the probability of success) rather than being estimated from relative frequencies in a sample, then you’re not working with a “sample” variance per se.

That's interesting, I thought sample variance is a general reference to whenever we are dealing with samples.
So, for example, if there is a factory only produces pens and there are 1000 pens (population) produced and 100 are chosen as samples. And question asks for the standard deviation from the sampling distribution.

If an overall faulty rate of 10% is given in regard to the whole population of 1000 pens, then it will be appropriate to use 10% as the probability to calculate the "sample" standard deviation for the sample, forming $\sqrt{\frac{10%(1-10%)}{100}}$ ? Otherwise, is $\sqrt{\frac{10%(1-10%)}{100}}$ the population standard deviation because you mentioned it as "then you’re not working with a “sample” variance per se". Then if this is the population standard deviation, then wouldn't it be different to $\sqrt{\frac{10%(1-10%)}{1000}}$ which shall also represent the population standard deviation?

Moreover, if the overall faulty rate of 10% is disregarded, and an independent research is conducted to the 100 chosen sample and found a fault rate of 5%, then the "sample" standard deviation shall be $\sqrt{\frac{5%(1-5%)}{100-1}}$ ?

I am a bit confused now and might have misinterpreted your message, I will be really appreciated if you can reply to me.
Many thanks

Trebla · Oct 2, 2020

The value of the sample variance is the spread of the data within the sample. It will always vary depending on what sample you get. It is a realisation of an experiment so it shouldn't really have any probabilities in it. In your example, say you take a sample of 100 pens out of 1000 pens and see how many are faulty. It doesn't make sense that in every sample of 100 pens you will get exactly 10 pens that are faulty. You may get 9 or 12 pens etc as it depends on your sample.

Perhaps worth looking at this from first principles. Let $X_1, X_2,...,X_n$ each be Bernoulli trials. Denote $x_1, x_2,...,x_n$ as the realisation (or data points) of each of these random variables. These are just a bunch of 0s and 1s where 0 represents failure and 1 represents success.

Suppose that out of the n trials, we get m successes (m<n) which is our particular sample. This means that our sample data has m lots of 1s and n-m lots of 0s so

$x_1 + x_2 + ... + x_n = m$

hence the sample mean is simply

$\bar{x} = \dfrac{m}{n}$

The sample variance is given by

$s^2=\dfrac{1}{n-1}\sum_{k=1}^n\left(x_k - \dfrac{m}{n}\right)^2$

After some algebra simplifying you get

$s^2=\dfrac{m(n-m)}{n^2(n-1)}$

which means that

$s^2=\dfrac{\bar{x}(1-\bar{x})}{n-1}$

Notice that your sample mean is in fact your sample proportion. At no point did we use the probability of success in the context of the sample but rather the relative frequency in the sample.

This is different to the variance of the random variable which uses probabilities and distribution functions to measure its spread. This is what most questions typically ask about.

2822309062 · Oct 2, 2020

That makes sense, thanks a lot

n or n-1 for sample standard deviation? (1 Viewer)

2822309062

New Member

Trebla

Administrator

2822309062

New Member

Trebla

Administrator

2822309062

New Member

Users Who Are Viewing This Thread (Users: 0, Guests: 1)