1.6 Data Representation Abstractions

A great deal of human activity centres on the collection and processing of data. Sight, smell, touch, taste, and hearing provide personal data input for most of us. We also collect data about the stock market, the economy, budgets, financial institutions, population statistics, chromatography experiments, seismological surveys, the weather, baseball and hockey teams, library loans, and a host of other events and activities. Any such data collection is likely to contain too many individual facts to be comprehended, as a whole--it must be processed in some fashion to become useful.

1.6.1 Data and Information

Raw data, as it is initially collected, is of little value or use. The human brain organizes audio/visual (and other) inputs and the mind interprets these inputs and assigns meaning to them. Thus a pattern of vibrations in the air is interpreted as conversation or as music, and a pattern of retinal impulses is a rose or perhaps a lover. One sees and hears, but there is much more to this than just a few organized sensory inputs, for intelligent organization is required to give meaning to the stimuli. Likewise, economic and scientific data consists only of raw numbers (symbols) until it has been organized and interpreted. Once such higher levels of meaning have been attached to data, it is termed information. This intelligent act of attaching meaning to raw data is clearly an abstraction process. Indeed, assigning meaning may to some extent be thought of as a synonym for the whole abstraction forming activity.

At a somewhat more concrete level, it is often possible to automate certain repetitious calculating tasks that are part of the process of placing meaning on data collections. These tasks are ideally suited to modern computing machinery and are driven by sets of instructions that achieve the mechanical aspects of the data organization. One can go farther than this and say that certain standard meanings are collected and tabulated, then automatically assigned to the items in the data collection by the computer program. The latter is then just a fast and reliable extension of human intelligence bent to the abstraction task

1.6.2 Encoding (Representing) Data

However, there are practical issues to solve at a lower level than describing what data processing is. These centre on how data is communicated. Whenever one writes a symbol like "4" or 'four," a potential for communication exists, based on the fact that these symbols encode a certain idea. By convention, everyone encodes the same idea with these symbols, so communication is possible.

Computation devices must also encode data in some consistent way. There are two categories of such codes:

1. External codes. These are usually human-readable characters that are input into a computer or output from it using a keyboard, screen, printer, or other device. The most common form for this data is a character such as "4," "a," "%," and so on.

2. Internal codes. Because computer storage is based on electronic circuits that can be thought of as ON/OFF switches, internal storage is in a different format than that used for human interface with the machine. In this form, it is not directly accessible by a human user. The person using a computing machine does not usually need to know what kind of internal representation is employed for data, because there are input/output programs that translate the data between internal and external formats.

There are some issues relating to data representation that do make a difference to programmers, however.

1.6.3 Atomic and Compound Entities

One of these issues is the level at which a data abstraction operates. Consider the symbol 4.25*10¹⁵ for example. On one level, this symbol can be thought of as having nine parts or components, one for each character used. On a second (higher) level, it consists of a mantissa (4.25) and an exponent (15) with the rest just punctuation. On a third level, one could take the symbol as an organic whole--a representation of a real number. This is what is usually done. Symbols like 27, -4.92, l6.3*10¹², and so on are normally thought of as atomic (indivisible) entities, and not in terms of the individual digits, signs, and decimal points.

There is data that is not like this. Consider

a set A= {1, 6, 9, 15}

an ordered pair P= (2,3)

or a complex number z = 4 - 6i.

Each of these is not only an entity in itself, but has also component parts that are sufficiently important to be referred to in their own right. One may use the individual elements of a set, the coordinates in an n-tuple, the parts of a complex number, and so on, in particular computations. In such cases, the structural parts of the data item become important in themselves. On the other hand, one may also refer to the entity as a whole, without making use of the knowledge of its constituent parts.

Similar situations arise when a data item has an assortment of related components of various kinds.

For instance, we might think of a student record as an aggregate (or a collective) consisting of:

name
address
phone number
marks
fees owed

Each constituent part of such a student record is a data item in itself, and so is the aggregate as a whole. Depending on what one is doing at the time (revising the marks, or storing the whole record) one might handle the data as an aggregate (a high level of abstraction) or by constituent parts (a low level of abstraction.) Indeed, in this case, a still higher abstraction may be employed if one wishes to deal with several such students as a group. Such a group could in turn be given a structure in a variety of ways:

a number could be painted on the back of each
they could be organized alphabetically by their names
they could be seated in rows and columns in a classroom

Data items that are normally handled as indivisible whole items are called atomic or unstructured and those that have component parts are called structured or compound.

1.6.4 The Concept of Type

It should be clear that data items like 4, 6, or 12 are of a different kind than those like "A," "Henry", or {5,9,16}. It is also the case, though perhaps not quite so obvious that 4,6,12 are also of a different kind than, say, -3 or 7.6.

The classification of data items into different kinds is done for one or more of three reasons:

1. the underlying concepts being abstracted by the symbols are different.

2. different kinds of data are stored by computing devices in different ways.

3. different operations may be defined on different kinds of data

These considerations are more evident in the case of atomic data on the one hand, and highly structured data on the other. However, mathematicians (and others) find it useful to distinguish among:

a. cardinals (also called unsigned whole numbers) 0,1,2,3,.......

(The strict mathematical convention is to use the term whole number only for what are called here unsigned whole numbers and not at all for the ones that might have negative values; the terminology here is looser, and conforms to the ISO standard vocabulary.)

b. integers (also called signed whole numbers)......-3,-2,-1, 0, 1,2,3,.......

c. reals numbers like 4.7, -3.98, 2.5 x 10⁷ , -6.3 x 10^-5

d. characters "A", "z", "2", " (whatever can be typed)

e. strings "Hi There" "Fred"

f. boolean false, true

g. various structures or collections of these basic kinds.

Note that one distinguishes a "4" (character) from a 4 (unsigned or signed whole number). One may often wish to distinguish a 4 from a 4.0 (real), or even between a 4 taken from the unsigned whole numbers and a 4 taken from the signed whole numbers. It may also be necessary to handle the character "4" differently from the one-character string "4".

Data that is of the numeric kinds (unsigned or signed whole number, and real) has certain operations defined for it. Indeed, without the operations of addition, subtraction, multiplication, division, and comparison, the numeric kinds would be incomplete. Anyone who uses such data assumes that the abstraction of numbers includes the ability to do such things.

On the other hand, these four operations mean little on, say, the character and boolean kinds. We could imagine other operations such as capitalize ("a") which produces "A" or possibly NOT (True) which produces False.

If the data is structured instead of atomic, the level at which an operation is applied on an item is also an important consideration. A group of students might have:

operations on the whole group

assign the group to a different room

operations on individuals as part of the group

add student to the group delete student from group re-position student within the group

operations on the particular data associated with an individual

change the last name of a student enter a mark for a student update the fees owing for a student

Thus, on any given level of the abstraction scale, the concept of the kind or type of data is incomplete without either implicitly or explicitly including the operations associated with the data at that level. Such considerations lead to the following definition:

A specified set of items with certain properties and operations in common is called an abstract data type, or ADT for short.

Some ADT's are easy to represent in a computer. BOOLEAN has only two possible values (True, False). These could be internally represented by, say, a zero and a one, respectively.

However, the numeric ADTs present one with various representation difficulties. As one conceives of them in mathematical terms, both unsigned whole numbers and signed whole numbers have an infinite number of possible values. To allow for that, the computer's storage would have to be unlimited. Since this is not possible, every actual system restricts these two types to some specific range of values. On smaller machines, this is often -32768.. 32767 for signed whole number and 0.. 65535 for unsigned whole number. On larger ones, the usable range could reach the millions or billions--but there is always a specific limit beyond which one cannot represent for the machine what are perfectly ordinary numbers. Users of a given system must simply know what this limit is, and be prepared to live within it.

The Real ADT presents another problem. Not only must there be maximum and minimum limits as for whole number ADTs, but there also restrictions on the available precision. No computer can make provision for storing an indefinite number of significant figures (decimal places); there is always some limit. If this limit is, say, eight figures, then 0.0000000001 cannot be distinguished from 0.0, nor can the completeness property of reals always be followed. (That is, that between any two reals there is another real.)

In the case of a data structure, different kinds of limitations may arise, for large numbers of similarly structured data items may not all fit in the memory of a computer at once, so a means must be found to store these on some external medium. It is then also necessary to have a method of reading from this external storage, altering some or all of the data, and writing it back again. Of course, if the collection becomes sufficiently large, even the external storage may be too small; it is then time to buy more disk or tape drives.

For such reasons, actual instances of ADTs on specific computing systems are always approximations to the intellectual concept one is trying to represent. This situation does not differ appreciably from the "real world" wherein measuring instruments also have limited capacity and/or precision. The users of any machine (whether a computer or a tape measure) must live with its limitations.

An instance of an ADT, complete with specified limits, is called an implementation of the ADT. One recognizes that such limitations vary from one computing platform to another by terming them implementation dependent.

It is also possible to distinguish three levels at which one could say that data exists:

as a 'pure' abstraction or idea
as a virtual implementation in some symbolic form (English description, mathematical or computing notation)
as an actual or physical implementation within a computing machine. (expressed, say, as voltages applied to certain electrical circuits)

1.6.5 Variables and Constants

In performing computations, it is often convenient to refer to data or to a portion of its structure by a named place holder. In algebraic terms, such names abstractly represent some value in a calculation.

One might write:

x = 5

y = 6

z = x + y

The first lines say to assign the name x to the value 5 and the name y to the value 6. The third line says to give the name z to the value obtained by adding the values represented by x and y.

Such named quantities fall into two broad categories--those understood to have a fixed value throughout the computation, and those whose value is subject to change during the course of the calculation. In writing: a formula like area = pi * radius^2 for instance, it is understood that the value of pi is fixed (approximately 3.14) whereas the values of the radius and area are subject to change depending on the particular instance of the problem. (See section 1.6)

Moreover, the type associated with the numbers when written literally (5, 6, 3.14) also attaches to the names used for them, so we say that these names too have a type. These considerations produce the following definitions:

A constant is a name for a value that is understood to be fixed.

A variable is a name whose value is subject to change during the course of a computation.

The type of a constant or a variable is the same as that of the abstract data type of which it names a particular instance.

This last definition may sound a little technical, but all it means in practice is that if the name pi is given the value 3.14159, then pi is of type real because its value is of type real.

Because such names are intended to abstract values, it is useful to employ descriptive names, so as to mirror function as part of doing the abstraction. Thus, one commonly employs names such as: interest, rate, time, speed, distance, area, perimeter, so on, rather than using a, x, i, P, s, a, b, and the like.

Contents