Beginning Statistics with the TI-82/83

Statistics, in its simplest form, is a science that arranges many facts into an organized picture of the data.  When this arrangement is done, numbers that are ordered from smallest to largest are sometimes clustered in reasonable intervals, and some patterns can become apparent.

Statisticians may need to determine which number appears most often, what is the average of all the numbers, which number is in the middle, or how great is the span from the larges to the smalls number.  They make charts and plot the numbers in various ways.  They may also compare one set of numbers to another to discover the similarities and the differences in the sets of data.  Each number in the data is significant and represents something important.  A statistician determines the best way to organize the numbers to produce the necessary information for solving problems and making predictions. 

We will begin with learning different ways of measuring univariate data, that is data of one variable.  From there we will look at data with two or more variables and see how these variables are related to one another—linearly, quadratically, cubically, exponentially, etc.. See the syllabus for an outline of what we will try to do.  We are not completely tied to the outline in the sense that if we need or want to slow down, we can and will.  If it takes us more time on nonlinear relationships, we will spend more time there.

Most of what we will be doing this first week will be descriptive statistics.  We will want to understand how we can describe the data and how we might describe it in the best possible way.  We don’t need to be reminded of the quotation, often attributed to Mark Twain: “There are lies, damn lies, and statistics.”

Also, most of what we do this first week will be applicable in many situations—not only in Math and Science.  Once again we are trying to show that Math and Science are applicable at a greater rate in our everyday lives. 

If you think of an experiment that you would like for the group to do, or see, that ties in with the mathematics that we are doing, feel free to tell us about it.  We want to see as many of the areas of science as we can during these two weeks.

One of the items that I will give to you before we complete this course is a bibliography of books at the Secondary level that try to integrate Mathematics and Science.  There are several on the market and there are more experimental projects underway—one in Oregon and one at the University of British Columbia.

We use data to communicate information and support decisions.  If we want clear and concise communication, we need to have the data well-organized.  Only the smallest data sets will not have to be somehow summarized and boiled down into a simpler form.  Larger data sets are intrinsically unclear without some summarization.  Data is usually summarized in tables, and sometimes in graphs.  Most data in almanacs is summary data.


Institutions of Higher Education—Charges: 1970 to 1992

Source: National Center for Education Statistics, U.S. Dept. of Education.

 

Data are for the entire academic year ending in year shown. Figures for 1970 are average charges for full-time resident degree-credit students; figures for later years are average charges per full-time equivalent student. Room and board are based on full-time students.

 

           Tuition and Required Fees              Board Rates                    Dormitory Charges 

Academic Control andYear

All

institutions

2-yr. colleges

4-yr. universities

All institutions

2-yr. colleges

4-yr. universities

All

institutions

2-yr. colleges

4-yr. universities

Public:

 

 

 

 

 

 

 

 

 

1970

$323

$178

$427

$511

$465

$540

$369

$308

$395

1980

583

355

840

867

894

898

715

572

749

1990

1,356

756

2,035

1,635

1,581

1,728

1,513

962

1,561

1991

1,454

824

2,159

1,691

1,594

1,767

1,612

1,050

1,658

1992

1,624

937

2,410

1,780

1,612

1,852

1,731

1,074

1,789

Private:

 

 

 

 

 

 

 

 

 

1970

1,533

1,034

1,809

561

546

608

436

413

503

1980

3,130

2,062

3,811

955

924

1,078

827

769

999

1990

8,147

5,196

10,348

1,948

1,811

2,339

1,923

1,663

2,411

1991

8,772

5,570

11,379

2,074

1,989

2,470

2,063

1,744

2,654

1992

9,434

5,752

12,192

2,252

2,090

2,727

2,221

1,789

2,860

The World Almanac® and Book of Facts 1994 is licensed from Funk and Wagnalls Corporation. Copyright © 1993 by Funk and Wagnalls Corporation. All rights reserved.

The World Almanac and The World Almanac Book of Facts are registered trademarks of Funk and Wagnalls Corporation.

Here we have the information summarized for us.  We definitely don’t have this information for each public and private institution in the U.S. — nor would we necessarily want that type of information.  Consider the following Table from the 1987 Census of Agriculture:

Farms by size (1987)

Size of farm
(acres)

Number of farms (thousands)

Percent of farms

Under 10

183

8.8

10–49

412

19.8

50–99

311

14.9

100–179

334

16.0

180–259

192

9.2

260–499

286

13.7

500–999

200

9.6

1000–1999

102

4.9

2000 and over

67

3.2


Organizing Data — Line Plots

We will look at some apple data taken from Investigating Apples, Christine V. Johnson, Addison Wesley.  Since she comes from the state of Washington, she has quite a bit of data at hand. 

After being harvested apples are sorted by size and packed in fiberboard boxes for shipment.  Each box contains 42 pounds of fruit, packed by count.  For example, a size 100 box has 100 apples of equal size for a combined weight of 42 pounds. Other standard sizes range from 48 to 216. Sizes 48 through 80 are considered large apples, 88 through 125 are medium apples, and 138 through 214 are small apples.  The following table gives approximate masses of different sizes of apples.

Scale of Size and Approximate Mass
Apples

Size

Mass in Grams

Size

Mass in Grams

48

397

125

153

56

340

138

136

64

298

150

127

72

264

163

116

80

238

175

108

88

215

198

96

100

190

216

88

113

167

 

 

We are given the following data about 36 size-80 apples of three different varieties.

Red Delicious

Red Delicious

Granny Smith

Granny Smith

Rome Beauty

Rome Beauty

204

238

227

220

187

203

212

239

193

221

188

205

215

239

214

223

188

206

221

239

206

217

192

206

221

240

224

206

192

207

222

241

237

228

192

207

223

241

209

205

193

210

224

241

210

229

194

210

225

242

228

229

196

217

226

245

231

229

198

217

226

247

214

230

198

219

227

248

186

211

199

220

231

253

215

235

200

224

231

255

216

206

200

228

233

257

212

239

200

228

233

263

217

240

200

231

234

264

219

241

201

236

237

266

220

245

202

246

We will construct line plots for this data.  Line plots are quick and simple ways to organize data.  From a line plot it is easy to spot the largest and smallest values, outliers, clusters, and gaps in the data.  It gives a nice presentation of the distribution and shows us a technique for computing the median value.  The line plots for these data sets are constructed as follows:

Draw a horizontal line and  put a scale of numbers on the line that runs from the least to the largest values of the data.  In our examples above our scale will run from 190 to 266 for the Red Delicious apples and 180 to 256 for the other two varieties.  Then put an X at the appropriate value for each data value in your list.

You construct the line plots for the other two sets of data.

Some of the features that we see from a line plot that are not apparent in a list of numbers are:

·    Outliers — data values that are substantially larger or smaller than the other                                            values.

·    Clusters — isolated groups of points

·    Gaps — large spaces between data points.

It is easy to spot the largest and smallest values from your line plot.  This is not true of a list of numbers unless they are ordered.


Organizing Data — Stem and Leaf Plots

We have several ways of displaying and interpreting sets of data.  A line plot tends to have a aizable spread that can make it difficult to recognize patterns, and the presentation may be less effective as a tool to aid in the interpretation of the data.  The stem-and-leaf plot[1] provides an alternative method and allows us to compare two sets of data.  These are easier to construct than histograms and bar graphs, but give us essentially the same type of information. 

First, find the smallest value and the largest value.  The smallest value for any of the three varieties is 186 and the largest is 266 grams.  This means that we are going to use the numbers 18 through 26 as the stems.

Next, write the stems vertically with a line to the right.

18

19

20

21

22

23

24

25

26

Lastly, separate each data value into a stem and a leaf and put the leaves on the plot to the right of the stem.  For example, the first data value from the Red Delicious apples is 204.  The stem is 20 and the leaf is 4.  The second value has a stem of 21 and a leaf of 2.  Continuing in this way we get the following plot for the Red Delicious apples.

18

 

19

 

20

4

21

2 5

22

1 1 2 3 4 5 6 6 7

23

1 1 3 3 4 7 8 9 9 9

24

0 1 1 1 2 5 7 8

25

3 5 7

26

3 4 6

The numbers in the stems are the hundreds and tens places of each of the data values while the leaves are the numbers in the ones places of the data entries.

The stem plot shows the shape of the data a little more clearly than the line plot.  This is because it is somewhat summarized.  Here we see a fairly symmetrical bell-shaped distributions with the lows balancing the highs.

Stem plots also offer us the opportunity to directly compare two sets of  grouped data.  We draw the stem and put the leaves of one set of data on the left and the leaves of the second set of data on the right.  These are sometimes called back-to-back stem plots.

Red Delicious

 

Granny Smith

 

18

6

 

19

3

4

20

5 6 6 6 9

5 2

21

0 1 2 4 4 5 6 7 7 9

7 6 6 5 4 3 2 1 1

22

0 0 1 3 4 7 8 8 9 9 9

9 9 9 8 7 4 3 3 1 1

23

0 1 5 7 9

8 7 5 2 1 1 1 0

24

0 1 5

7 5 3

25

 

6 4 3

26

 

We can see that both sets have the same basic shape, but “peak” at different places.  Two of the basic aspects that we can describe numerically about a set of data are the center of the distribution and the spread of the distribution.  There are two different ways to measure the location and the spread:

·    the median with the range;

·    the mean with the standard deviation.

Determining the median and the range is much simpler, requiring only counting and  understanding the fractions ¼, ½, and ¾.  We will work with both.


Median, Mean, Quartiles and Outliers

The average of a set of numbers is called the mean.  More precisely, it is called the arithmetic mean.  If our data is  then the mean is

.

To find the median, first order and count the data.  If there are an odd number of data points, then the median of the data is the middle data point.  In terms of variables, if our data set is , then the median is the point .  If there are an even number of data points, then the median is the average of the middle two values: .  This makes half of the data larger than the median and half of the data less than the median.

The mean mass of the Red Delicious apples is 236.19 grams and the median mass is 237.5 grams.

To find the range, subtract the smallest number from the largest.  The range for the Red Delicious apples is 266-204 = 62 grams. 

Arrange the values of  your data in order from smallest to largest.

204 212 215 221 221 222 223 224 225 226 226 227 231 231 233 233 234 237 238 239 239 239 240 241 241 241 242 245 247 248 253 255 257 263 264 266

Find the median and draw a line through it.

204 212 215 221 221 222 223 224 225 226 226 227 231 231 233 233 234 237 |
238 239 239 239 240 241 241 241 242 245 247 248 253 255 257 263 264 266      

Now consider only the data that is less than the median.  There are 18 values to the left of the median.  The lower quartile, or first quartile, is the median of these values.

Finally, consider the data that is greater than the median.  The median of these values is the upper quartile, or third quartile.

The interquartile range(IQR) is the difference in the upper and lower quartiles.  In our example, the interquartile range is 18.

The lower extreme is the smallest value in the data and the upper extreme is the largest value in the data.  Basically, the quartiles and the median divide the data up into 4 equal sets.

We mentioned an outlier earlier and indicated that it is a value that is widely separated from the rest of the data.  How far separated does a value need to be from the others before we are willing to call it an outlier?  By convention, most authors will define an outlier to be any number that is more than 1.5 times the interquartile range above the upper quartile or below the lower quartile.  There is no particular reason we could not have multiplied the IQR by 2 or 1.75 or any other reasonable number.  It is just a generally accepted convention that we multiply the IQR by 1.5 to find any outliers.  Our IQR is 18, so any outlier would have to have a mass of more than  243.5 + 1.5(18) = 270.5 grams or a mass of 225.5 - 1.5(18) = 198.5 grams.  We don’t have any outliers here, using this multiplier.


Percentage of Sugar in Cereals[2]

Product

% Sugar

Product

% Sugar

Sugar Smacks (K)

56.0

Kellogg Raisin Bran(A)

29.0

Apple Jacks (K)

54.6

C.W. Post, Raisin, (A)

29.0

Froot Loops (K)

48.0

C.W. Post, (A)

28.7

General Foods Raisin Bran (A)

48.0

Frosted Mini Wheats (K)

26.0

Sugar Corn Pops (K)

46.0

Country Crisp (K)

22.0

Super Sugar Crisp (K)

46.0

Life, cinnamon (K)

21.0

Crazy Cow, chocolate (K)

45.6

100% Bran (A)

21.0

Corny Snaps (K)

45.5

All Bran (A)

19.0

Frosted Rice Krinkles (K)

44.0

Fortified Oat Flakes (A)

18.5

Frankenberry (K)

43.7

Life (A)

16.0

Cookie Crisp, vanilla (K)

43.5

Team (A)

14.1

Cap’n Crunch, Crunchberries(K)

43.3

40% Bran (A)

13.0

Cocoa Krispies (K)

43.0

Grape Nuts Flakes (A)

13.3

Cocoa Pebbles (K)

42.6

Buckwheat (A)

12.2

Fruity Pebbles (K)

42.5

Product 19 (A)

9.9

Lucky Charms (K)

42.2

Concentrate (A)

9.3

Cookie Crisp, chocolate (K)

41.0

Total (A)

8.3

Sugar Frosted Flakes of Corn (K)

41.0

Wheaties (A)

8.2

Quisp (K)

40.7

Rice Krispies (K)

7.8

Crazy Cow, strawberry (K)

40.1

Grape Nuts (A)

7.0

Cookie Crisp, oatmeal (K)

40.1

Special K (A)

5.4

Cap’n Crunch (K)

40.0

Corn Flakes (A)

5.3

Count Chocula (K)

39.5

Post Toasties (A)

5.0

Alpha Bits (K)

38.0

Kix (K)

4.8

Honey Comb (K)

37.2

Rice Chex (A)

4.4

Frosted Rice (K)

37.0

Corn Chex (A)

4.0

Trix (K)

35.9

Wheat Chex (A)

3.5

Cocoa Puffs (K)

33.3

Cheerios (A)

3.0

Cap’n Crunch, peanut butter (K)

32.2

Shredded Wheat (A)

0.6

Golden Grahams (A)

30.0

Puffed Wheat (A)

0.5

Cracklin’ Bran (A)

29.0

Puffed Rice (A)

0.1

 

The K stands for Kids Cereal and A for Adult cereal.  We are interested in some of the different ways that we can describe this data.  Draw a stem-and-leaf plot for the adults and the kids cereal.  Find the mean, median, quartiles, range, and interquartile range.  Do we have any outliers?



[1] Sometimes called a stem plot.

[2] Source: United States Department of Agriculture, 1979.  Taken from Exploring Data, James Landwehr and Ann Watkins, Dale Seymour Publications, 1986.