7.7 Some Statistical Tools

One of the most common applications of computing machinery is that of the analysis of data. It is often the case that a researcher is faced with large amounts of data describing the results of experiments. In order to make sense of this, certain standard tools are often applied, and in this section, two of the simplest are considered. The first of these is the *mean* or ordinary average:

Themeanof a group of data is their sum divided by the number of items in the group.

Besides the mean itself, one often wants to know something about how closely clustered are the data points. For instance the mean of the data

13,14,13,15,16,13,15,16,14,15,14,14,13,12,13

is 14 and so is the mean of the data

1,1,1,1,1,1,1,1,1,1,1,1,1,1,196

yet these two data sets are very different, and there has to be a way to measure that difference.

One way of expressing the scattering of data from the mean is given by the standard deviation. It is found by taking the difference between the mean and each data item, squaring this number and adding all these squares. One then divides by the number of items and takes the square root of the final result.

where *M* is the mean. That is, the standard deviation is the square root of the expression on the right.

**NOTES**: The expression above is used only when the standard deviation is computed on the entire population (all possible data.) If the statistics are gathered on a sample of the population, the denominator is changed to *n - 1*.

The quantity (standard deviation)^{2}is called thevarianceof the data.

To construct a satisfactory Modula-2 routine for computing standard deviation, note that if the numerator of the above expression is expanded, one gets:

or

Substituting the formula for the mean, namely

yields

or

In the case of a *sample* population, the *n* in the main denominator becomes *n - 1* and this formula is

If a summation notation is employed, these two are written:

Whichever of these (whole population or sample) is needed, this form is much easier to work with than the original definition, because the algorithm can operate by storing running totals of numbers as they are read or entered and also storing (separately) the running sum of their squares. The standard deviation can be determined at any point for the data entered thus far, or after all items have been entered, for the division by *n* or by *n - 1* can be done at any time.

One could even decide to save storage space by not storing all the data items in an array as they are examined, but instead keeping track only of the two running totals and the number of items. While this suggestion is not in fact adopted in the code that follows, it could be an important one if the number of data entries in some disk file were very large and the amount of available memory comparatively small.

To obtain maximum flexibility, the statistical functions have been divided into two library modules, one a *low level* module whose only purpose is to accumulate the running number of items, sum, and sum of squares as data is fed to it, and to report the three statistical measures when desired. The second, higher level module drives the lower level one and does the computations of mean and standard deviation using results obtained from it.

DEFINITIONMODULELowStats; (* Library of commonly used low level statistical functions design by R. Sutcliffe & portions of implementation by Mark Harder last revision 1993 04 06 *)PROCEDUREReset (); (* Use this procedure before starting to call accumulating variables for a new calculation Pre: none Post: the number of items, sum, and sum of squares are all set to zero. Max is set toMIN(REAL) and min is set toMAX(REAL)NOTE: The initialization code of the module calls Reset. *)PROCEDUREEnter (x :REAL); (* Pre: If this is the first call of this procedure for a new set of data, Reset must be called first. Post: the number of items, running sum and running sum of squares are updated *)PROCEDURESize () :CARDINAL; (* Pre: none Post: returns the number of items accumulated since the last call to Reset. *)PROCEDURESum () :REAL; (* Pre: none Post: returns the sum of the items accumulated since the last call to Reset. *)PROCEDURESumSquares () :REAL; (* Pre: none Post: returns the sum of the squares of the items accumulated since the last call to Reset. *)PROCEDUREMax () :REAL; (* Pre: none Post: returns the largest of the items accumulated since the last call to Reset. *)PROCEDUREMin () :REAL; (* Pre: none Post: returns the smallest of the items accumulated since the last call to Reset. *)ENDLowStats.IMPLEMENTATIONMODULELowStats; (* Library of commonly used low level statistical functions design by R. Sutcliffe & portions of implementation by Mark Harder last revision 1993 04 06 *)VARcount :CARDINAL; sum, sumSq, max, min :REAL;PROCEDUREReset ();BEGINcount := 0; sum := 0.0; sumSq := 0.0; max :=MIN(REAL); min :=MAX(REAL);ENDReset;PROCEDUREEnter (x :REAL);BEGININC(count); sum := sum + x; sumSq := sumSq + x * x;IFx > maxTHENmax := xEND;IFx < minTHENmin := xEND;ENDEnter;PROCEDURESize () :CARDINAL;BEGINRETURNcount;ENDSize;PROCEDURESum () :REAL;BEGINRETURNsum;ENDSum;PROCEDURESumSquares () :REAL;BEGINRETURNsumSq;ENDSumSquares;PROCEDUREMax () :REAL;BEGINRETURNmax;ENDMax;PROCEDUREMin () :REAL;BEGINRETURNmin;ENDMin;BEGIN(* initialization code *) Reset;ENDLowStats.

Observe that no error handling has been done. There could be problems if the real type is overflowed, for example. In the exercises, the reader is asked to remedy this oversight. In the higher level module that follows, there is some redundancy, for the procedures *Largest* and *Smallest* return results that could be obtained by calling the lower level module directly. However, there are other possible higher end modules that could be written here; not all of them might need the maximum and minimum items.

DEFINITIONMODULEStats; (* Library of commonly used statistical functions design by R. Sutcliffe & portions of implementation by Mark Harder last revision 1993 04 06 *)PROCEDUREEnterData (items :ARRAYOFREAL; numItems :CARDINAL); (* Pre: the items in use are numbered 0 .. numItems - 1 Post: data is ready for analysis with statistical functions below *)PROCEDURELargest () :REAL; (* Pre: none Post: The highest value in the last array submitted with EnterData is returned *)PROCEDURESmallest () :REAL; (* Pre: none Post: The lowest value in the last array submitted with EnterData is returned *)PROCEDUREMean() :REAL; (* Pre: none Post: The mean of all the values in the last array submitted with EnterData is returned *)PROCEDUREVariancePop () :REAL; (* Pre: none Post: The population variance of all the values in the last array submitted with EnterData is returned *)PROCEDUREVarianceSamp () :REAL; (* Pre: none Post: The sample variance of all the values in the last array submitted with EnterData is returned *)PROCEDUREStdDevPop () :REAL; (* Pre: none Post: The population standard deviation of all the values in the last array submitted with EnterData is returned *)PROCEDUREStdDevSamp () :REAL; (* Pre: none Post: The sample standard deviation of all the values in the last array submitted with EnterData is returned *)ENDStats.IMPLEMENTATIONMODULEStats; (* Library of commonly used statistical functions design by R. Sutcliffe & portions of implementation by Mark Harder last revision 1993 04 06 *)FROMLowStatsIMPORTReset, Enter, Size, Sum, SumSquares, Max, Min;FROMRealMathIMPORTsqrt;PROCEDUREEnterData (items :ARRAYOFREAL; numItems :CARDINAL);VARcount :CARDINAL;BEGINReset;FORcount := 0TOnumItems - 1DOEnter (items [count]);END;ENDEnterData;PROCEDURELargest () :REAL;BEGINRETURNMax();ENDLargest;PROCEDURESmallest () :REAL;BEGINRETURNMin();ENDSmallest;PROCEDUREMean () :REAL;BEGINRETURNSum() /FLOAT(Size());ENDMean;PROCEDUREVariancePop () :REAL;VARsize :REAL;BEGINsize :=FLOAT( Size ());RETURN(SumSquares () - (( Sum() * Sum()) / size)) / size;ENDVariancePop;PROCEDUREVarianceSamp () :REAL;VARsize :REAL;BEGINsize :=FLOAT( Size ());RETURN(SumSquares () - (( Sum() * Sum()) / size)) / (size - 1.0);ENDVarianceSamp;PROCEDUREStdDevPop () :REAL;BEGINRETURNsqrt (VariancePop ());ENDStdDevPop;PROCEDUREStdDevSamp () :REAL;BEGINRETURNsqrt (VarianceSamp ());ENDStdDevSamp;ENDStats.

Notice how the work has been distributed so that most of the procedures have only a line or two of code. This makes easier to debug than it would be otherwise. In fact, when the test program below was compiled and run, the only errors found were in the client. The library modules needed no corrections after the initial compilation; all their functions worked correctly the first time.

MODULETestStats; (* by R. Sutcliffe to test the statistics modules last revision 1993 04 06 *)FROMStatsIMPORTEnterData, Largest, Smallest, Mean, VariancePop, VarianceSamp, StdDevPop, StdDevSamp;FROMSRealIOIMPORTReadReal, WriteFixed;FROMSTextIOIMPORTWriteString, WriteLn;FROMSWholeIOIMPORTWriteCard;FROMRedirStdIO (* non-standard *)IMPORTOpenOutput, CloseOutput;PROCEDUREWriteStats;BEGINWriteString ("largest is "); WriteFixed (Largest (), 2, 0); WriteLn; WriteString ("smallest is "); WriteFixed (Smallest (), 2, 0); WriteLn; WriteString ("mean is "); WriteFixed (Mean (), 2, 0); WriteLn; WriteString ("population variance is "); WriteFixed (VariancePop (), 2, 0); WriteLn; WriteString ("sample variance is "); WriteFixed (VarianceSamp (), 2, 0); WriteLn; WriteString ("population standard deviation is "); WriteFixed (StdDevPop (), 2, 0); WriteLn; WriteString ("sample standard deviation is "); WriteFixed (StdDevSamp (), 2, 0); WriteLn;ENDWriteStats;TYPEDataArray =ARRAY[0 .. 20]OFREAL;VARtheData : DataArray;BEGINtheData [0] := 13.0; theData [1] := 14.0; theData [2] := 13.0; theData [3] := 15.0; theData [4] := 16.0; theData [5] := 13.0; theData [6] := 15.0; theData [7] := 16.0; theData [8] := 14.0; theData [9] := 15.0; theData [10] := 14.0; theData [11] := 14.0; theData [12] := 13.0; theData [13] := 12.0; theData [14] := 13.0; OpenOutput; EnterData (theData, 15); WriteString (" First Run:"); WriteLn; WriteStats; WriteLn; theData [0] := 1.0; theData [1] := 1.0; theData [2] := 1.0; theData [3] := 1.0; theData [4] := 1.0; theData [5] := 1.0; theData [6] := 1.0; theData [7] := 1.0; theData [8] := 1.0; theData [9] := 1.0; theData [10] := 1.0; theData [11] := 1.0; theData [12] := 1.0; theData [13] := 1.0; theData [14] := 196.0; EnterData (theData, 15); WriteString (" Second Run:"); WriteLn; WriteStats; CloseOutput;ENDTestStats.

As can be seen, the data chosen for the two runs are the very collections with which this discussion began. The results are given below:

First Run: largest is 16.00 smallest is 12.00 mean is 14.00 population variance is 1.33 sample variance is 1.43 population standard deviation is 1.15 sample standard deviation is 1.20 Second Run: largest is 196.00 smallest is 1.00 mean is 14.00 population variance is 2366.00 sample variance is 2535.00 population standard deviation is 48.64 sample standard deviation is 50.35