Statistical Description of Data part 6

Chia sẻ: Dasdsadasd Edwqdqd | Ngày: | Loại File: PDF | Số trang:4

Thêm vào BST

Báo xấu

66
lượt xem 8
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Norusis, M.J. 1982, SPSS Introductory Guide: Basic Statistics and Operations; and 1985, SPSSX Advanced Statistics Guide (New York: McGraw-Hill). Fano, R.M. 1961, Transmission of Information (New York: Wiley and MIT Press), Chapter 2

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Statistical Description of Data part 6

636 Chapter 14. Statistical Description of Data Norusis, M.J. 1982, SPSS Introductory Guide: Basic Statistics and Operations; and 1985, SPSS- X Advanced Statistics Guide (New York: McGraw-Hill). Fano, R.M. 1961, Transmission of Information (New York: Wiley and MIT Press), Chapter 2. 14.5 Linear Correlation visit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America). readable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMs Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machine- Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) We next turn to measures of association between variables that are ordinal or continuous, rather than nominal. Most widely used is the linear correlation coefﬁcient. For pairs of quantities (xi , yi ), i = 1, . . . , N , the linear correlation coefﬁcient r (also called the product-moment correlation coefﬁcient, or Pearson’s r) is given by the formula (xi − x)(yi − y) i r= (14.5.1) (xi − x)2 (yi − y)2 i i where, as usual, x is the mean of the xi ’s, y is the mean of the yi ’s. The value of r lies between −1 and 1, inclusive. It takes on a value of 1, termed “complete positive correlation,” when the data points lie on a perfect straight line with positive slope, with x and y increasing together. The value 1 holds independent of the magnitude of the slope. If the data points lie on a perfect straight line with negative slope, y decreasing as x increases, then r has the value −1; this is called “complete negative correlation.” A value of r near zero indicates that the variables x and y are uncorrelated. When a correlation is known to be signiﬁcant, r is one conventional way of summarizing its strength. In fact, the value of r can be translated into a statement about what residuals (root mean square deviations) are to be expected if the data are ﬁtted to a straight line by the least-squares method (see §15.2, especially equations 15.2.13 – 15.2.14). Unfortunately, r is a rather poor statistic for deciding whether an observed correlation is statistically signiﬁcant, and/or whether one observed correlation is signiﬁcantly stronger than another. The reason is that r is ignorant of the individual distributions of x and y, so there is no universal way to compute its distribution in the case of the null hypothesis. About the only general statement that can be made is this: If the null hypothesis is that x and y are uncorrelated, and if the distributions for x and y each have enough convergent moments (“tails” die off sufﬁciently rapidly), and if N is large (typically > 500), then r is distributed approximately normally, with a mean of zero √ and a standard deviation of 1/ N . In that case, the (double-sided) signiﬁcance of the correlation, that is, the probability that |r| should be larger than its observed value in the null hypothesis, is √ |r| N erfc √ (14.5.2) 2 where erfc(x) is the complementary error function, equation (6.2.8), computed by the routines erffc or erfcc of §6.2. A small value of (14.5.2) indicates that the
14.5 Linear Correlation 637 two distributions are signiﬁcantly correlated. (See expression 14.5.9 below for a more accurate test.) Most statistics books try to go beyond (14.5.2) and give additional statistical tests that can be made using r. In almost all cases, however, these tests are valid only for a very special class of hypotheses, namely that the distributions of x and y jointly form a binormal or two-dimensional Gaussian distribution around their mean visit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America). readable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMs Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machine- Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) values, with joint probability density 1 p(x, y) dxdy = const. × exp − (a11 x2 − 2a12 xy + a22 y2 ) dxdy (14.5.3) 2 where a11 , a12 , and a22 are arbitrary constants. For this distribution r has the value a12 r = −√ (14.5.4) a11 a22 There are occasions when (14.5.3) may be known to be a good model of the data. There may be other occasions when we are willing to take (14.5.3) as at least a rough and ready guess, since many two-dimensional distributions do resemble a binormal distribution, at least not too far out on their tails. In either situation, we can use (14.5.3) to go beyond (14.5.2) in any of several directions: First, we can allow for the possibility that the number N of data points is not large. Here, it turns out that the statistic N −2 t=r (14.5.5) 1 − r2 is distributed in the null case (of no correlation) like Student’s t-distribution with ν = N − 2 degrees of freedom, whose two-sided signiﬁcance level is given by 1 − A(t|ν) (equation 6.4.7). As N becomes large, this signiﬁcance and (14.5.2) become asymptotically the same, so that one never does worse by using (14.5.5), even if the binormal assumption is not well substantiated. Second, when N is only moderately large (≥ 10), we can compare whether the difference of two signiﬁcantly nonzero r’s, e.g., from different experiments, is itself signiﬁcant. In other words, we can quantify whether a change in some control variable signiﬁcantly alters an existing correlation between two other variables. This is done by using Fisher’s z-transformation to associate each measured r with a corresponding z, 1 1+r z= ln (14.5.6) 2 1−r Then, each z is approximately normally distributed with a mean value 1 1 + rtrue rtrue z= ln + (14.5.7) 2 1 − rtrue N −1 where rtrue is the actual or population value of the correlation coefﬁcient, and with a standard deviation 1 σ(z) ≈ √ (14.5.8) N −3
638 Chapter 14. Statistical Description of Data Equations (14.5.7) and (14.5.8), when they are valid, give several useful statistical tests. For example, the signiﬁcance level at which a measured value of r differs from some hypothesized value rtrue is given by √ |z − z| N − 3 erfc √ (14.5.9) 2 visit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America). readable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMs Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machine- Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) where z and z are given by (14.5.6) and (14.5.7), with small values of (14.5.9) indicating a signiﬁcant difference. (Setting z = 0 makes expression 14.5.9 a more accurate replacement for expression 14.5.2 above.) Similarly, the signiﬁcance of a difference between two measured correlation coefﬁcients r1 and r2 is   |z1 − z2 | erfc  √  (14.5.10) 1 1 2 N1 −3 + N2 −3 where z1 and z2 are obtained from r1 and r2 using (14.5.6), and where N1 and N2 are, respectively, the number of data points in the measurement of r1 and r2 . All of the signiﬁcances above are two-sided. If you wish to disprove the null hypothesis in favor of a one-sided hypothesis, such as that r1 > r2 (where the sense of the inequality was decided a priori), then (i) if your measured r1 and r2 have the wrong sense, you have failed to demonstrate your one-sided hypothesis, but (ii) if they have the right ordering, you can multiply the signiﬁcances given above by 0.5, which makes them more signiﬁcant. But keep in mind: These interpretations of the r statistic can be completely meaningless if the joint probability distribution of your variables x and y is too different from a binormal distribution. #include #define TINY 1.0e-20 Will regularize the unusual case of complete correlation. void pearsn(float x[], float y[], unsigned long n, float *r, float *prob, float *z) Given two arrays x[1..n] and y[1..n], this routine computes their correlation coeﬃcient r (returned as r), the signiﬁcance level at which the null hypothesis of zero correlation is disproved (prob whose small value indicates a signiﬁcant correlation), and Fisher’s z (returned as z), whose value can be used in further statistical tests as described above. { float betai(float a, float b, float x); float erfcc(float x); unsigned long j; float yt,xt,t,df; float syy=0.0,sxy=0.0,sxx=0.0,ay=0.0,ax=0.0; for (j=1;j
14.6 Nonparametric or Rank Correlation 639 sxy += xt*yt; } *r=sxy/(sqrt(sxx*syy)+TINY); *z=0.5*log((1.0+(*r)+TINY)/(1.0-(*r)+TINY)); Fisher’s z transformation. df=n-2; t=(*r)*sqrt(df/((1.0-(*r)+TINY)*(1.0+(*r)+TINY))); Equation (14.5.5). *prob=betai(0.5*df,0.5,df/(df+t*t)); Student’s t probability. /* *prob=erfcc(fabs((*z)*sqrt(n-1.0))/1.4142136) */ visit website http://www.nr.com or call 1-800-872-7423 (North America only),or send email to trade@cup.cam.ac.uk (outside North America). readable files (including this one) to any servercomputer, is strictly prohibited. To order Numerical Recipes books,diskettes, or CDROMs Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copying of machine- Copyright (C) 1988-1992 by Cambridge University Press.Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) For large n, this easier computation of prob, using the short routine erfcc, would give approx- imately the same value. } CITED REFERENCES AND FURTHER READING: Dunn, O.J., and Clark, V.A. 1974, Applied Statistics: Analysis of Variance and Regression (New York: Wiley). Hoel, P.G. 1971, Introduction to Mathematical Statistics, 4th ed. (New York: Wiley), Chapter 7. von Mises, R. 1964, Mathematical Theory of Probability and Statistics (New York: Academic Press), Chapters IX(A) and IX(B). Korn, G.A., and Korn, T.M. 1968, Mathematical Handbook for Scientists and Engineers, 2nd ed. (New York: McGraw-Hill), §19.7. Norusis, M.J. 1982, SPSS Introductory Guide: Basic Statistics and Operations; and 1985, SPSS- X Advanced Statistics Guide (New York: McGraw-Hill). 14.6 Nonparametric or Rank Correlation It is precisely the uncertainty in interpreting the signiﬁcance of the linear correlation coefﬁcient r that leads us to the important concepts of nonparametric or rank correlation. As before, we are given N pairs of measurements (xi , yi ). Before, difﬁculties arose because we did not necessarily know the probability distribution function from which the xi ’s or yi ’s were drawn. The key concept of nonparametric correlation is this: If we replace the value of each xi by the value of its rank among all the other xi ’s in the sample, that is, 1, 2, 3, . . ., N , then the resulting list of numbers will be drawn from a perfectly known distribution function, namely uniformly from the integers between 1 and N , inclusive. Better than uniformly, in fact, since if the xi ’s are all distinct, then each integer will occur precisely once. If some of the xi ’s have identical values, it is conventional to assign to all these “ties” the mean of the ranks that they would have had if their values had been slightly different. This midrank will sometimes be an integer, sometimes a half-integer. In all cases the sum of all assigned ranks will be the same as the sum of the integers from 1 to N , namely 1 N (N + 1). 2 Of course we do exactly the same procedure for the yi ’s, replacing each value by its rank among the other yi ’s in the sample. Now we are free to invent statistics for detecting correlation between uniform sets of integers between 1 and N , keeping in mind the possibility of ties in the ranks. There is, of course, some loss of information in replacing the original numbers by ranks. We could construct some rather artiﬁcial examples where a correlation could be detected parametrically (e.g., in the linear correlation coefﬁcient r), but could not