14.0 Introduction
14.1 Some statistical terms
14.2 The statistics program
14.3 Activities
Statistics is the mathematical treatment of sets of data. In this chapter we deal with the simple statistics of a set of measurements which might be expected to be randomly distributed around a steady or mean value. The measurements could be the weights or heights of a group of people; or the exam marks of a class.
The program in this chapter displays such sets of data. It also calculates and displays the mean of the measurements, their standard deviation and the standard error on the mean. Then it superimposes the shape of a normal (or Gaussian) distribution with the same standard error, so that you can decide whether your data follows the standard distribution closely enough for your purpose.
Our statistics program calculates and prints the mean of a set of measurements, their standard deviation and the standard error on the mean. The meanings of these terms are as follows:
The mean of a set of
measurements is given by the following formula where
means 'sum of':
The standard deviation a- is given by the following formula:
The standard error on the mean is given by:
The normal or Gaussian distribution has the following formula:
It can be recognised by its characteristic bell-shaped curve which is symmetrical about the mean.
The statistics program is given in Listing 14.1. Screen Display 14.1 is typical of what it can do. The measurements are for the percentage alcohol in home made wines. They were collected over a number of years from members of an evening class on wine making, and total 56 measurements. All the members were making the same wine to the same recipe, so variations could be expected to be randomly distributed about a mean or a fixed value.
As Screen Display 14.1 shows, the distribution does not follow the normal (or Gaussian) distribution because it is not bell-shaped and symmetrical about the mean. Possibly some of the wines did not ferment to completion whereas the majority did. The majority should therefore be close to the maximum possible value while the others tail down to zero. The mean of the measurements, the standard deviation and the standard error on the mean are all printed on the screen.
The listing is a little longer than those that you will have come to expect. This is because the lines which work out the mean, the standard deviation and the standard error on the mean are in the program, rather than in a separate procedure.
The program stores the information in the form of OAT A statements. This is important, for in such a program these may be very many items. These must be in a form which can easily be checked and edited. For Screen Display 14.1, there are 56 items in the DATA statements, and such a large number is bound to need careful checking for accurate transcription.
The program lines between 110 and 160 calculate the sum of the data in preparation for lines 210 to 260 which calculate the mean and the standard deviation and line 690 which calculates the standard error on the mean from the formulae given in Section 14.1. Line 290 then dimensions the X and Y arrays while 310 calls on PROCscale. When doing so, it uses just two points corresponding to the maximum and minimum data items. This sets up the scaling and allows lines 320 and 330 to arrange that the bars of the histogram fit in neatly with the graduations along the axes. These lines arrange that there will be at least 5 bars and not more than 15. If you do not like these limits, you can alter them accordingly.
Lines 350 to 380 assign the X co-ordinates for the histogram bars. With the number of bars set up and the co-ordinates of the bars calculated, the next task is to find the frequencies with which the data items fit the bar categories. This is what lines 410 to 460 do.
One further call to PROCsca1e at line 500 allows the scaling to be set up in the Y direction, that in the X direction remaining unchanged. A call to PROChisto draws up the bars and PROCaxes, PROCgraduate and PROCnumber put in the axes and scales.
Finally the normal distribution is drawn using the XOR method of plotting, set up by GCOL3,1 in line 560. Lines 580 to 600 then calculate 100 points along the normal distribution function and plot them using the DRAW statement.
i. Run the program of Listing 14.1 using your own data.
ii. Save the program, retrieve it and edit it in some way.
Screen Display 14.1
You may like to compare your distribution with some distribution other than the normal, Gaussian distribution - perhaps with the Poisson distribution or the Binomial distribution. You can easily do so by modifying lines 570 and 590 of Listing 14.1 which calculate and display the shape of the normal, Gaussian distribution with the same standard error. The formula for the Gaussian distribution is given in Section 14.1. In BASIC the right-hand side becomes:
K*EXP(-(X-MEN)*(X-MEN)/(2*SI*SI)/SI
This is used in lines 570 and 590, and you can easily replace it with an expression for another distribution.