This page was created and is maintained by Chad E Brown.
Herbert A. Simon and Craig A. Kaplan, "Chapter 1: Foundations of Cognitive Science," in M. I. Posner (Ed.) Foundations of Cognitive Science. MIT Press (1989).
This chapter explores the scope and dimensions of cognitive science, the study of intelligence and intelligent systems (especially computational aspects of such systems). The "cognitive disciplines" are each described in relation to cognitive science. These include psychology, artificial intelligence, linguistics, philosophy, neuroscience, as well as (to a lesser degree) anthropology, economics, and statistical decision theory.
The architecture of the human system is postulated to exist in levels, namely the neural level of interest to connectionists and the symbol level where information processing occurs. The standard psychological model of LTM (with relational "semantic" memory and an "index") and STM (as the activated portion of LTM) is described. Schemas (giving relational, declarative information) and productions (giving procedural information) are introduced. Control structures decide which productions to fire (if there is a conflict). The control structure may be as simple as ordering the production, or may involve goal symbols in the conditions of the production. For example, SOAR creates goal symbols to resolve such an "impasse." Network models (which may be parallel) are used by connectionists. Symbol models are used by those modeling the symbol level. Some symbolic models are network models incorporating parallelism, but most symbol models operate serially.
The difference between reasoning and search is discussed, including the questions of the uniqueness of language (regarding Chomsky's hypothesized language acquisition mechanism), inference rules vs. operators (for modeling thinking), and propositions vs. images (for knowledge representation). The authors determine these issues are a matter of degree rather than one or the other.
Then, the authors discuss the methodology of cognitive science, especially techniques of gathering and interpreting data. Protocol analysis is discussed at length (gathering the data from the subject, segmenting the data, encoding the data, etc.). The hope with protocol analysis is to have the subject output the contents of (auditory) STM as they solve a problem, which should give access to the knowledge heeded to solve the problem, and to interpret this output in such a way as to infer the underlying processes. Reliability and the use of interactive computer tools is also discussed. Also, using content analysis (of articles, texts, and notebooks) as experimental data is discussed (for example, to infer how a discovery was made). AI and computer simulation is discussed giving the examples of programs including: STUDENT (which solve algebra word problems) and READER (which reads text and builds a representation of the information in the text which can be used to answer questions and build summaries).
Meta-analysis of a variety of experiments and simulation reported in the literature is discussed.
The contribution of philosophy and logic is discussed, including the contribution and limitations of symbolic logic; the necessity of defining terms such as "thinking," "understanding," and "intuition"; the Turing Machine as a precise model of computation; the implications of Godel's incompleteness theorems (as applied to humans as well as computers); reasoning under uncertain conditions; and modal and nonmonotonic logics.
The themes underlying this discussion of methodology are:
1. It is important to have a "dense set of data points at the level of interest."
2. "The boundary conditions of a methodology must be established before the full potential of the method can be realized." (Compare protocol analysis to its forerunner, introspection.)
3. Interaction and combination of methodologies can create powerful new approaches.
Finally, the authors discuss the search for the invariant laws of cognition. The idea is that in any empirical science the task is to "discover and verify invariants in the phenomena under study." Since cognitive science studies intelligent systems capable of adapting to an external environment, this task is quite complicated. Different invariants should be expected in different time scales. In the case of an adaptive system, these invariants should be sought at a higher level, and the invariants should restrict the adaptive processes. On the one hand, the external "outer" environment provides problem spaces that require heuristic search (so the adaptation would involve learning heuristics that limit search in an immense problem space). On the other hand, the "inner" environment has limitations such as serial processing and the capacity of STM (or, registers in a computer). Evidence shows that experts rely on an extensive knowledge-base (gathered, for example, by a human in about 10 years of study). Several thousand production rules may be necessary for an expert system. With such an extensive knowledge base, it is clear how recognition could seem mysterious and seem to involve "intuition." So, invariants involving the existence and the rate of acquisition of such a knowledge base apparently exist. Other invariants occur in the brain near the sensory organs. These have been studied at the neurological level.
Finally, learning is discussed briefly. We must accept that the learning level is probably not invariant--as we can "learn to learn."
Craig A. Kaplan and Herbert A. Simon, "In Search of Insight," Cognitive Psychology 22, 374-419 (1990).
This paper uses the example of the Mutilated Checkerboard (MC) problem to demonstrate how insight (or, the AHA! experience) can correspond to a change in representation of a problem. The basic idea is that a problem solver first works in a space which is suggested by the instructions and after failing to solve the problem, the subject begins a meta-level search for a new representation (this also takes place in a problem space). This space of representations is not a well-understood search space with a simple set of generators, and so part of the purpose of the study was to discover what heuristics subjects used to guide their search for a new representation. One of the most interesting heuristics is to notice "invariants."
The experiment performed on human subjects was to have the subjects solve the MC problem within an hour or less. Different groups of subjects were given different styles of boards, each affected the salience of the parity of the squares. (One had blank squares, another had black and pink colored squares, another used the words "black" and "pink", and the fourth used the words "bread" and "butter.") The subjects were given hints (if hints were needed) to ensure that the problem was solved within the hour.
Constraints for the meta-level search which were examined by this study included: problem features, hints, relevant domain knowledge, and heuristics (such as attending to invariants).
It is argued that the difficulty in the MC problem lies mainly in the search for a covering in the original space, then in the search for a new representation, but not primarily in the search for a solution in the parity space. Once the notion of parity has been attended to, the solution is found fairly quickly.
The results of the experiment are discussed. This includes the results that those subjects whose checkerboard made parity less salient generally required the parity hint. The effect of cue salience and hints on the results is examined. The fact that knowledge can actually lead a problem solver down dead ends, wasting time, instead of helping solve the problem faster is discussed. In particular, it is noted that "noticing parity usually leads rapidly to a solution when the goal is to prove a covering impossible, but it seems to contribute not at all when the goal is to find a covering." Also, three general heuristics for searching in the meta-space are discussed briefly: Noticing Invariants, Forming Hypotheses, and Comparing Alternative Board Situations.
The idea of noticing invariants reminds me of dependence and independence of concepts on variables, but in a way I have not made precise. (Idea: We never say something is "invariant" but instead say it is "invariant with respect to . . .".)
Also, fast subjects and slow subjects are compared with respect to how many and what types of invariants they notice during comparable time frames. For example, the faster subjects noticed more perceptual invariants (but, of course, in the MC problem, the perceptual property of parity--in a case in which the alternating squares are labeled or colored--is the key to noticing useful invariants). It is also noted that faster problem solvers notice more varied types of invariants (they "think flexibly").
Howard B. Richman, James J. Staszewski, and Herbert A. Simon, "Simulation of Expert Memory Using EPAM IV." Psychological Review, 102:2 (1995) 305-330.
This paper examines the case of a subject DD who was trained over a 3 year period to memorize long lists of digits (up to over 100) presented at a rate of 1 digit/sec. It compares this human subject with the performance of EPAM IV which also learned to perform this "expert memorization" task. In order to perform this task, EPAM IV (and--the evidence suggest--the subject DD) builds a retrieval structure (a tree structure of nodes) in LTM which is used to store the digits and semantic information about groups of digits (in particular, DD associated them with running times, ages, dates, etc.--EPAM IV was programmed with this semantic information in LTM when the experiment was started). Information about the digits is stored both in the retrieval structure and in semantic memory associated with the semantic information mentioned above. Also, patterns were encoded and associated with groups of digits (such as "back to back" or "palindrome" style patterns). This redundancy allows EPAM (& DD) to recover missing information, leading to a much greater perfect recall rate than could otherwise be obtained.
The retrieval structure allows a subject to store digits which are presented at such a rapid rate because the discrimination net "index" does not have to be modified for each digit. Only the proper associations must be made. The learning curves for DD and two runs with EPAM are compared and are very similar. DD hit a plateau at around 80 digits, and there are some hypotheses explaining why this might have happened (such as the need for a retrieval structure with a higher branching rate at the root). Evidence suggests that DD forgot approximately 25% of the information placed in the retrieval structure (recovering the necessary information via redundancy). EPAM was set to use this same forgetting rate (and it could be argued that this parameter was used to make the learning curves fit, though this is denied in the article).
The actual task is divided into stages: Preparation (in which the subject/EPAM is told how long the list will be and the subject/EPAM recalls the appropriate retrieval structure), Study (when the digits are given at a rate of 1 per second), Rehearsal (in which the subject moves backwards through the groups, and EPAM is programmed to study the last few digits more intensely--apparently building new chunks, then progresses backwards through the list using a depth-first search and filling in some missing information), and Recall (in which DD outputs the digits in order, and EPAM is programmed to output the digits at a rate of 200 ms per digit using the retrieval structure, pattern information, and semantic categories (only searching the "active" part of the discrimination net to search for semantic categories)). The reported pause latencies of DD during recall generally agree with his (reported) retrieval structure, except with a slight modification of the retrieval structure for over 100 digits.
Further tests were done to study the effect of lists with many patterns vs. those with patterns removed and the effect of trial order (and the growing "activated" area of the discrimination net for EPAM). Also, the subject and EPAM were tested on memory scanning tasks (in which a group of digits from the list is given [visually for DD] and the subject/EPAM is asked for the next or previous group) and a free-recall task (in which the subject/EPAM is asked to recall all digits used in many practice sessions--these were reported by DD in the order of his organization by semantic categories). The conclusions are that EPAM simulates DD's learning, simulates DD's use of patterns, simulates interference ("proactive interference") via expanding regions of activation in the discrimination net, simulates DD's memory scanning ability, simulates pause times during recall, and the times for various operations.
Robert S. Siegler, "Chapter 8: Problem Solving," in Children's Thinking (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall, 1991.
This chapter is concerned with the problem solving abilities of children of various ages (occasionally compared with adult problem solving). A common theme is that younger children are better problem solvers than previously thought and older children are worse problem solvers than previously thought.
The aspects of problem solving examined include encoding, mental models, domain-general and domain-specific knowledge, and developmental differences. He begins with a problem which is easy to understand yet challenging for children and adults. A balance scale is held in place while weights are placed on the balance at different distances from the fulcrum. The subject is asked what will happen if the balance is released (balance/tip right/tip left). The author hypothesized that children would use one of four rules to make the prediction: the first uses only weight, the second uses weight first and relies on distance if the weights are equal, the third uses weights and distances but is unable to resolve conflicts between the prediction of the weights and distances, and the fourth can resolve such conflicts by calculating the "torque." Experiments confirmed that one of these rules was used by each subject. One interesting point was that 16- and 17-year olds who had learned about this problem in two classes (but with a slightly different representation) did not use Rule IV because they were unable to generalize their knowledge and apply it to the unfamiliar representation.
Some 5- and 8-year olds who used Rule I were shown that their predictions in the case of conflict (with distance dominating). The 5-year olds typically were unable to explain why, apparently because they were not encoding the distances. The 8-year olds typically were encoding the distances and used them to induce Rule II (or even Rule III). Once the 5-year olds were instructed to encode distance as well as weight, they often were able to induce Rule II.
A similar task involving prediction of the shadows of T-shapes of various sizes and distances from light sources is discussed, demonstrating the generality of the balance task.
The next part of the chapter discusses the specific problem-solving processes of planning, causal inference, analogy, tool use, and logical deduction.
Planning: Planning is different than problem solving as planning emphasizes future action and may only form an incomplete plan (leaving details to be filled in later). Two examples of planning are means-ends analysis (e.g., Tower of Hanoi) and route planning. This section inspired me to create(?) the quote: "He who hesitates is planning."
Causal Inference: This section is very interesting. It describes British philosopher David Hume's "three features that could lead people to infer that two events are causally related." These are contiguity (close in time and space), precedence (cause precedes effect), and covariation (occurred together regularly in the past). Another feature is a generative mechanism that implies the cause and effect relationship. People tend to use a generative mechanism to infer a causal relationship. Failing that, they use contiguity. Failing these, they use precedence and covariation. This section also discusses the differences between the reasoning of children (and adults) and scientific reasoning. In particular, the children have trouble setting up an experiment from which they can draw conclusions, and even when they do set up a valid experiment they often ignore conclusions which are inconsistent with their a priori beliefs.
Analogy: The ability to recognize and use analogies between problems is discussed. Sometimes cues are required to recognize the analogy. Sometimes the fact that an analogy exists must be made explicit. This varies with age.
Tool Use: Young children are capable of using tools such as rakes or canes to pull a toy towards them. Also, indirect tools such as models (e.g., a model of a room with a toy in it) can be used (e.g., to find the toy in the life-size room) by children at least about 3 years old. Conceptual tools such as measuring and counting can be used. However, tools can also lead to mistakes--such as thinking 3 chunks of food is the same amount as 3 chunks of food, disregarding the size of each chunk.
Logical Deduction: This section concentrates on transitive inference (especially linear orderings which occur in the real world, such as "taller than" or "larger than" or "longer than") and hierarchical organization (such as fruit-->grape-->red seedless grape). The trouble children have with representing and retaining ordering information is discussed. With respect to hierarchical organization and class inclusion, it is noted that different responses can result from using class nouns (such as trees, which can refer to many trees or only 2 trees) or collection nouns (such as forests, which cannot refer to 2 trees). Finally, it is noted that children do not trust logical necessity as much as empirical evidence. "When an experimenter said, 'Either the chip I am holding in my hand is blue or it is not blue,' 7-year-olds insisted that the experimenter open her hand before they would accept the statement as correct." This distrust of logic leads some children to decide a particular finite subset of a particular finite set is smaller than the set by counting the elements of each rather than deducing that all subsets are necessarily smaller than any set which contains them.
K. Anders Ericsson and Herbert A. Simon, "Verbal Reports as Data," Psychological Review 87:3 (1980), 215-250.
This paper argues that verbal reports can be used as data. Assuming the appropriate instructions are given (and assuming the basic information-processing model), a subject performing a task can verbalize (some of) the contents of STM without affecting the cognitive processes used to perform the task (except with a decrease in speed, especially if some of the data to be verbalized is not in auditory form). Many criticisms of verbal reports stem from justified criticisms of introspection. However, these criticisms in the end make clear a need for proper a methodology for verbal reporting. With a proper methodology, verbal reporting may reliably give the experimenter access to the information heeded by the subject while performing the task.
Verbal reporting is classified according to the following dimensions:
Relative Primacy -- Verbalization unrelated to task (verbalization and task-related information heeded separate and distinct); Verbalization is "subordinate to, and passively dependent on, the ongoing cognitive processes" (verbalization gives task-related information heeded); Verbalization "is primary and must follow requirements of form and content imposed, for example, by instructions" (verbalization gives some task-related information which is of interest to the experimenter)
Concurrent and Retrospective Verbalization -- Concurrent means verbalization as the task is performed (outputs contents of STM); Retrospective means giving a report about cognitive processes performed earlier.
Recoding -- Level 1, direct, verbalization (information is output in the form it was acquired); Level 2, "Immediate recoding into verbal code" (information is immediately recoded into verbal code before being verbalized); Level 3, "Intermediate scanning or filtering processes" (in which information must be checked for relevant content before being verbalized) or "Intermediate inference or generative processes" (requires the subject to heed information which the subject otherwise would not).
Forms of Probing -- Think or Talk Aloud (direct trace of heeded information); concurrent probing (asking questions of subject while performing task, such as 'what hypotheses are you using to distinguish instances of this concept?'); retrospective probing (questioning the subject after each task); interpretive probing (questioning the subject after all experiments)
Particular and General Reports -- Subjects may be asked for general procedures (which may evoke responses independent of the particular performances of the tasks [relying on prior knowledge], or may induce the subject to generalize from particular trials, or other possibilities); Subjects may be asked to recall more specific information; Subjects may be probed for hypothetical states; Subjects may be probed for general states (which may be responded to "without reference to the context of the experiment").
Directed or Specialized Probing -- Often the experimenter is interested in specific kinds of information, and may try to direct probes accordingly. Some problems with this approach is discussed. The obvious alternative is to have the subject verbalize everything (too much data is not a problem here).
The information processing model is described, which attempts to be unbiased and only assume (essentially) STM with limited capacity (contains the information heeded), LTM with longer fixation and access times, and a central processor (CP) which has access to STM. Aspects of the architecture include control of attention (including interrupts), fixation of information (in LTM), and automation (like compiling) of processes. Within the context of this model, concurrent verbalization outputs chunks in STM which denote symbols in verbal mode. Other chunks in STM may be recoded into verbal mode, then output, assuming task-related processes do not take precedence. Retrospective verbalization outputs the information still in STM and may output information stored in LTM, though this may be unreliable as the indices may not have had time to form completely.
The rest of the article examines numerous experiments in the literature which deal with verbal reports. They are divided into discussions of "(a) the effects on the cognitive processes of the instruction to verbalize and of probes, (b) the completeness of verbal reports, and (c) the consistency of verbal reports with other empirical data on behavior."
(a) This is not a very significant issue if the only information to be verbalized is Level 1 or 2.
(b) Incompleteness may have three causes: the information is actually not heeded; not all the information in STM is reported (for example, processes may become automatic); some information previously in STM is not in LTM or is not retrievable from LTM. One issue discussed is that subjects may learn a concept without being able to accurately verbalize the concept learned. This seems to me to be because the subject is learning the concept by updating his or her discrimination net, and such information need not be directly accessible to STM. All STM would have access to would be positive and negative examples, from which "higher level" processes could try to build a concept definition. Similar remarks apply to a lack of awareness by a subject of mediating associations (i.e., a subject learning A->C after learning A->B & B->C without consciously realizing the use of the intermediate association).
(c) Studies claiming inconsistency of verbal reports with behavior are discussed, with the conclusion that the information-processing model would not predict the referenced experiments to give consistent data.
Gordon S. Novak Jr. "Representations of Knowledge in a Program for Solving Physics Problems." Proceedings of the Fifth International Joint Conference on Artificial Intelligence, M.I.T., Cambrdige, MA, August 1977.
This paper describes a program, ISAAC, which solves physics problems by parsing the English problem using an Augmented Transition Network, building a "language-free" interpretation in terms of frames, then abstracting object frames into canonical object frames (e.g., a man becomes a "point mass" or a "pivot point"). These canonical objects are related by the laws of physics and the appropriate equations are set up and solved. This program is more sophisticated than others such as STUDENT (for algebra word problems) and CARPS (for Calculus problems) which took a more direct path from the English representation to the problem representation.
The most interesting idea here is the abstraction from objects to canonical objects. This (if done correctly, and if I understand correctly) gives a deductive abstraction, in which a solution for canonical objects gives a solution for the "actual" objects. The conclusion contains some interesting points about use of canonical objects (which possibly carry over to deductive abstraction) in scientific discovery.
Yulin Qin and Herbert A. Simon, "Imagery and Mental Models in Problem Solving," AAAI Spring Symposium on Reasoning with Diagrammatic Representations (1992).
This short paper concerns an experiment in which six subjects were asked to study "the first part of Einstein's 1905 paper on special relativity" and try to understand the material while providing information in the form of verbal protocols and (in some cases) sketched diagrams. The information was used to infer how subjects used imagery to help understand relationships among the different variables in the thought experiments (such as, "A beam of light travels from A to B . . ."). These images were used to help derive the Lorentz equations (after several sessions) from the constancy of the speed of light. The evidence also seems to indicate that mental models (information in LTM) affect these images (which can lead to misconceptions, especially in a case such as relativity in which the subject must deal with a non-Euclidean conception of space-time). Furthermore, these images can affect the mental models in LTM.
Allen Newell and Paul S. Rosenbloom, "Mechanisms of Skill Acquisition and the Law of Practice." In J. Anderson (Ed.), Cognitive skills and their acquisition. Hillsdale, NJ: Erlbaum, 1981.
The authors examine the power law of learning. This law states that T = BN-a where T is the time to perform a task after N trials. B is the time taken on the first trial. This is linear in log-log space (i.e., take logs of both sides and compare logT to logN).
Several experimental examples from earlier work is given to support such learning curves (by showing the graph is log-log linear). One of the most interesting tasks is that of having 10 lights either on or off and having the subject press corresponding keys. This example is expanded upon in terms of chunking later in the paper.
A section examining the mathematical properties of power laws notes the derivative (thinking of N as a continuous variable) is
dT/dN = (-a/N)T.
For exponential learning we have dT/dN = -aT (a learning rate proportional to time). There is an obvious similarity, except that the learning rate for the power law slows down with the number of trials (which corresponds to a slower approach to the asymptote).
The power law can be generalized (losing the log-log linearity) by including a constant E>0 (added to N) representing prior experience and a constant A>0 (giving a nonzero asymptote). The constant A can also be used to generalize the exponential learning law. It is noted that if we view the number of trials as a function of time instead of the way above, the power law is still a power law (with a different a). The exponential law becomes a hyperbolic law, and the hyperbolic law becomes an exponential law.
A section discusses fitting data to a family of curves. The families of curves considered here are exponential, power laws, and hyperbolics. Hyperbolics are actually special cases of power laws (with a=1). It is argued that the power law fits experimental curves better than exponentials or hyperbolics, but it is unclear if this is genuine or caused by the fact that power laws have 4 degrees of freedom (parameters to be chosen) while exponentials and hyperbolics have only 3 degrees of freedom (parameters to be chosen). In any case, it is argued that the difference between the fits of exponentials and power laws rule out exponentials. The difference between the fits of hyperbolics and power laws is not enough to rule out hyperbolics.
A number of possible mechanisms could account for the power law. Some which have been suggested include the mixtures argument, stochastic selection and exhaustion of exponential learning. The mixtures argument suggests that performance depends upon several mechanisms which can learn (speed up) at an exponential rate. The constant in the exponent, however, varies with the mechanisms (some mechanisms learn faster than others). It is suggested that the power law could be a manifestation of this situation. This is possible, but such a situation could also account for any number of other curves (this is suggested using Fourier transforms for arbitrary [nice] functions). Two examples of stochastic selection provide alternative explanations. Crossman's model involves trials of different methods (according to some probability measure which changes with time). Replacement (of incorrect responses to correct responses) vs. accumulation (of correct responses drowning out incorrect responses) corresponds to exponential vs. power law learning. Finally, the notion of exhaustion (that the power law of learning stems from the fact that some portion of the learning process is being diminished over time) is discussed. This notion leads to a chunking theory of learning which leads to a power law of learning. The ideas are discussed in the framework of the 10 lights/responses experiment of Seibel (1963). One interesting aspect of the discussion is that the more high-level a chunk is, the more specific it is (and the less likely it is to be part of a concrete situation). Of course, the chunks being discussed here really encode entire patterns (with no loss of information).
John R. Anderson, "Acquisition of Cognitive Skill," Psychological Review, 89:4, 369-406 (1982).
This paper describes how the ACT system can acquire skills by acquiring productions. This skill learning is done in three stages. First, the knowledge used to perform the task is stored declaratively and used by general problem solving productions to solve the problem. Second, this declarative knowledge is converted into procedural knowledge by composing and proceduralizing productions. Third, the production system is "tuned" using generalization, discrimination, and strengthening. The theory is used to (approximately) explain the power law of learning.
The author first describes the ACT system using an example task: multicolumn addition. The significant features of the system include the goal structure, conflict resolution (refractoriness, specificity, then strength), and variables (all of which are standard components for production systems--except the goal structure which is built into ACT, where in OPS goals are WME's--or this is my understanding of ACT from the paper).
The author then describes learning in ACT, starting with the Declarative Stage--using the example task of proving simple geometry theorems which he has three high school students perform. The students read the relevant material in the text. From this they (apparently) form declarative knowledge which can be used by general productions to solve problems. The students must rehearse the information in order to keep it activated (or, in STM) long enough for the general productions to use it. Loss of information forces the students to do some recomputation. Some of the students' mistakes and misconceptions are noted. Also, a justification for the idea that declarative knowledge precedes procedural knowledge is given. It is argued that declarative knowledge is less "potentially dangerous" because if it is found to be wrong (i.e., in conflict with other knowledge), then it can be deleted. This is not necessarily true of procedural knowledge. (This could explain why people are so damn locked into their paradigms!)
The author then discusses the second stage, Knowledge Compilation. In ACT, knowledge compilation is accomplished by composition and proceduralization. Composition involves creating a new production by composing n productions which fired in sequence (and are related by a goal). Proceduralization involves instantiating certain variables in a production by constants and eliminating unnecessary checks. Composition creates productions which make more demands on STM. Proceduralization creates productions which make fewer demands on STM.
The author then discusses the final stage, Procedural Learning, or Tuning. After task-specific procedures to perform a task have been compiled, ACT continues to learn by improving its methods for solving the task. This is accomplished by generalization, discrimination, and strengthening. Generalization involves building a new production from one or more by replacing constants with variables and dropping conditions. Obviously this creates a more general production that those from which it was formed. Discrimination involves taking a production which is known to be incorrect (either by external or internal criteria) and making a new production which might be correct. The new production can either be a condition discrimination, which involves restricting the old conditions further, or an action discrimination, which changes the action of the old production (possibly restricting the conditions further as well, using the conditions of the "incorrect firing" as a guide). He gives examples involving language acquisition. Finally, strengthening involves associating a "strength" value with productions (which is used in conflict resolution). Productions are strengthened (by adding .025) when they (or a production from which they were generalized) are fired. Productions are weakened (by multiplying by .25) when they apply but receive negative feedback. The idea is that generalization and discrimination lead to many productions, and strengthening can be used to choose the ones which will actually be used in the long run.
Finally, some equations are derived which support the idea that ACT's learning scheme approximates a power law.
The real importance of this paper seems to be that it contains specific methods for proceduralizing declarative knowledge. It is unfortunate (in my opinion) that it does so in such an inductive manner (in the spirit as most "heuristic-based" AI).
Pat Langley, Jan Zytkow, Herbert A. Simon, Gary L. Bradshaw, "Mechanisms for Qualititative and Quantitative Discovery."
This paper describes four systems which simulate scientific discovery: BACON, GLAUBER, STAHL, and DALTON. BACON mainly finds patterns in quantitative data, inducing equations which give relationships between variables. It is also able to postulate "intrinsic properties" associated with symbolic data. The authors note the difference between data-driven and theory-driven discovery techniques. Data-driven techniques are bottom-up, while theory-driven techniques are top-down. BACON is essentially data-driven.
GLAUBER discovers qualitative information. In particular, facts are presented to the system in terms of "frames"--i.e., "(predicate object property-list)". The system creates classes of particular elements which show up as common properties in different facts. (Hence, the system abstracts.) It is also capable of integrating two classes which have "enough" members in common. (Hence, the system generalizes.) This system seems the most relevant to my interests. The example he uses to demonstrate GLAUBER is creation of the concepts "acids", "alkalis" and "salts" using information about taste and chemical reactions. They note that the frame representation has an ambiguity with respect to the quantification of variables, once class variables have been introduced (through abstraction). They also note the need for heuristics to order the classes by importance.
The system STAHL is more specific to chemistry (but could be used in other domains as well). STAHL infers the components of substances by examining chemical reactions and applying heuristics to simplify the reactions. They demonstrate how STAHL might infer the phlogiston theory.
Finally, the system DALTON is used to construct molecular models. The program searches through a space of models which attempt to model a chemical reaction by inducing the structure of the components of the reaction (i.e., the molecular ratios as well as the molecular structures). DALTON uses a conservation assumption to restrict its search. In this sense, it is more theory-driven than the other systems.
Finally, the authors discuss possible interactions between the different systems which could be used to build an integrated discovery system. The conclusion relates these systems with other AI work, including Lenat's AM.
Tulin Qin and Herbert A. Simon, "Laboratory Replication of Scientific Discovery Processes." Cognitive Science 14, 281-312 (1990).
This paper discusses experiments in which subjects were given data which correspond to the distances (from the sun) and periods of the planets Mercury, Venus, Earth, Mars and Jupiter. The data was not labeled as such, but only labeled as five values of "s" and "q". The subjects were asked to induce a relationship between the values. The correct response was Kepler's Third law, that the square of the period is equal to the cube of the distance.
The program BACON can induce this formula using heuristics which allow it to recognize constants, linear functions, and look at quotients which are both increasing and products of variables when one is increasing and the other is decreasing. These heuristics lead BACON to examine P/D, P/D^2, then P^2/D^3.
The behavior of the subjects is discussed. In the first experiment, subjects were given calculators which included exponents and logs. Two subjects solved the problem, one got close, and the other six failed to solve the problem. Three subjects never considered nonlinear relationships. Linear relationships were considered most frequently. While there were individual differences in the functions considered, a common heuristic is to consider simple functions over more complex ones. Most subjects used scatter diagrams.
The behavior of a successful subject, SY, can be interpreted as search in a rule space and an instance space (as described in Simon and Lea (1974), see Models of Thought I:5.5). The rule space is partitioned into two levels: a function space and a parameter space. A production system describing SY's search is described (and a problem behavior graph is included in the appendix). A very important aspect of SY's behavior is the use of feedback from hypothesis tests to choose a new hypothesis. Many subjects who failed to solve the problem simply generated a new hypothesis whenever one failed (without examining the nature of the failure). Also, the other successful subject, S3, is discussed along with a discussion of the unsuccessful subjects.
In the second experiment, subjects were given calculators which did not contain exponents or logarithms (which effectively prevented subjects from searching for logarithmic relationships--as some subjects did in experiment 1). Two subjects succeeded while the other three failed. The results correspond closely with experiment 1.
The heuristics actually (apparently) used by subjects in the experiments are listed (even those which are faulty). Finally, the nature of data-driven discovery is discussed and placed in a historical context.
I asked Simon about induction vs. deduction. In particular, I brought up the fact in his paper (Chapter 5.5 with Glenn Lea, "Problem Solving and Rule Induction (1974)" in Models of Thought I) that both problem solving and rule induction are inductive processes (rejecting the "deductive"/"inductive" dichotomy suggested by Hunt). I then asked, "Do you believe there are any processes or techniques that humans use which are deductive?"
(He begins by writing "Induction" and "Deduction" on the board. He notes that these are the two basic methods of reasoning, "and we all have a basic idea of what they mean.")
Induction: Induction is a method of going from particulars to the general. For example, if every raven you see is black, you may want to conclude, "All ravens are black." But this is, of course, uncertain. This uncertainty is inherent in the inductive process. You've only ever seen a finite amount of data--a finite number of ravens. Perhaps someday you'll see a raven that's white, or even blue. (My aside: This, of course, reminds me of my Spume article about "ducks can be blue.") In the example of the Thurstone Letter Series Completion Test ("if we can stand to see it again"--he writes "ABMCDM_" on the board), we might induce a pattern and try to extrapolate the sequence by writing "EFM". But there's no guarantee that's what the experimenter had in mind. "There are some malicious people in the world, and he may not . . . he or she may not . . . (well, mostly men are malicious), he may have had some completely different pattern in mind." For example, in music it would be reasonable for the pattern to be "ABMCDMABMCDM..." "That's the music that comes thumping out of those cars."
This problem of when you can generalize has long concerned philosophers, and the definitive answer was given by Hume--you can never be certain. The problem of induction concerned Simon's advisor Carnap for the last twenty years of his life, in which he was trying to determine when an inductive generalization could be made with certainty. "I always thought that was too bad because he could've been doing something much more important the last twenty years of his life."
Many people adopt a probabilistic view of inductive certainty. They hypothesize a pattern which is supposed to exist in reality and ask themselves what is the probability that the universe is actually random. So, they set up their hypothesis against the ("I don't want to say strawman") hypothesis that there is no order in the universe. Then they get certainty that their hypothesis is true with p=0.000... (laughter) (Chad's comment: Newton's law of gravitation would look "certain" by this standard, since gravitation appears to have some pattern which is at least close to Newton's law and far from being random. And we know Newton's law is wrong by the falsification provided by the orbit of Mercury.)
So, if induction is uncertain, why use it? Well, for Simon, it stems from a belief that there are patterns in the universe, that there are "natural kinds." Why believe that? Simon describes the situation as an "all win/no loss" scenario. If there are patterns in the universe, then he actually is going about finding those patterns. "If there are no patterns, well, then it doesn't matter, does it?" So, he looks for these patterns. "And, of course, sometimes we fall flat on our face" (he says, waving his arms to his bruised face. Chad's note: Simon apparently fell last week between Tuesday and Thursday. Last Thursday his eyes were especially black and blue and he had a bandage covering part of his forehead. His eyes still look quite bruised.)
Deduction: In deduction, you start with some rules and some sentences. "Maybe you just start with a sentence, like 'Socrates is a man.' Well, I guess you need some rules too, like 'all men are mortal.'" According to deductive logic, you can conclude with certainty that Socrates is mortal. He drew the deduction on the board as:
Socrates is a man
All men are mortal
______________
Socrates is mortal
Why are we so certain of this knowledge? Now, we have to be careful, the certainty is not that Socrates is mortal, but that if Socrates is a man and if all men are mortal, then Socrates must be mortal. Why do we feel so certain of this?
Student 1: "Because we can draw Venn diagrams, with man a circle inside mortal and Socrates inside man." (Simon draws this diagram on the board and explains it.) Simon explains that if we find such diagrammatic diagrams convincing, then this works, but the question of why we find it convincing is still on the floor.
I suggested that it is the nature of language and our notions of truth, falsehood and their relationship. He asked me to elaborate. I gave the example of "and" meaning logical conjunction. He asked more specifically about the Socrates example. I noted that the example used both predicates and a universal quantifier, which takes us into the predicate calculus. So, I explain, truth is a collection of sentences which can be produced by certain productions. These productions involve certain logical constructs which define the relationship between truth and falsehood. "So, is it just convention, then?" he asked me. "Do you find that convincing?" I responded, "I don't want to disagree with that, but I do find it very convincing." (I disliked his use of the word "convention," but could not think of a more appropriate word.) "You're free to disagree, you know," he said, then immediately went to another student with their hand up.
Student 1: "The reason I find this convincing is because Socrates is dead." (Laughter) Simon explains that isn't really the point. The point is that because Socrates was a man and because we believe all men are mortal, we feel forced to believe that Socrates was mortal, and that we'd be "lamenting the death of poor old Socrates even if they hadn't poisoned him." "We could change the problem. Let's see, I won't make it too personal by saying 'Herb Simon is mortal', but how about 'Bill Clinton is mortal.' Do we believe that?" (The students agreed that we do.) "But he isn't dead yet?"
Student 2: "It's easy. We find this convincing because of basic logic. If Socrates is a man, and all men are mortal, then of course Socrates is a man." Simon's response: "Well, not every relation is transitive. My father is my father, and his father is his father, but his father is not my father."
Simon pauses, and as I start to say "So . . ." Simon begins to speak, then halts and allows me to continue. I ask, "So is the question now: 'Why do we find logic convincing?'" "Yes," Simon responds.
Simon elaborates on the history of relying on the certainty of logic. At the turn of the century, Frege and Whitehead (later with Russell) were attempting to reduce mathematics to logic. Why? Because logic seems so certain. Then Russell discovered a paradox in their system. Now, he later overcame this problem, but in a highly implausable (according to Simon) manner using his theory of types. Later there were better methods for getting around this problem in a more natural way. Then, Russell and Whitehead wrote Principia Mathematica, three large volumes of formal symbols attempting to reduce mathematics to logic. "Have any of you ever looked at Principia Mathematica?" A few of us raised our hands. "It's not pleasant reading, is it? I mean, from a literary standpoint." So, why did they do it? Because we find logic so convincing. But why? Why do we find logic so convincing?
One answer that some people have given is that logic doesn't tell you anything new. It doesn't rule out any possible worlds. It doesn't tell you anything empirical about the world.
Simon's answer, from a psychological point of view, is that we find logic convincing because we think in production systems. This is especially clear in the example of Socrates. We can have a Working Memory Element (man Socrates) and a production (man X) ==> (mortal X). (He put these on the board.) "Now, what's the first thing that's going to happen when this system 'comes to life'?" Someone tells him it will fire the production and put (mortal Socrates) in working memory. "That's right! It's inevitable. You can program it tonight." (laughter)
So, Simon has concluded that the reason we find deductive logic so convincing is because the firing of productions corresponds to the application of deduction rules. He has coauthored a recent paper with Eisenstadt (see below) making this argument in response to Rips' book which claimed we think in a logical representation.
But, back to the original question: Most of what humans do is inductive reasoning. There was some confusion about this because his first AI project was the Logic Theorist which proved deductive theorems of logic. This led some people to conclude that the computer was performing deduction. Actually, the program was finding proofs of deductive theorems, and finding proofs is an inductive process. "We first planned to write a chess program, then we backed off from that and planned to do a geometry proving program--which still would have been 'deductive'--then we finally settled on doing the Logic Theorist." Even chess has rules that must be followed, like the rules of deduction, but deciding what do with those rules is an inductive process. The rules tell you what you can do, not what to do. "It's a lot like human laws. Human laws have a lot of 'Thou shalt not's', but not a lot of information on what 'thou' ought to do."
PS: There was also the mention of a Mathematician who had worked on foundations until Goedel's incompleteness theorem was proven. Simon had lunch with him twenty years later and the Mathematician said that he had never worked on foundations again. Simon said he could see the sadness in the Mathematician's eyes.
Stuart A. Eisenstadt and Herbert A. Simon, "Logic and Thought." Minds and Machines 7:365-385, 1997.
This paper is a response to Lance Rips' Deduction-System Hypothesis put forward in his book The Psychology of Proof. The Deduction-System Hypothesis suggests that certain deduction principles are embodied in the human cognitive architecture. Simon and Eisenstadt respond with a converse thesis, the Production-System Hypothesis. The converse thesis suggests that the human cognitive architecture is a production system which builds in certain deductive principles.
Production systems and the (EPAM) index for memory are discussed. A production is analogous to an implication statement, and a working memory element is analogous to atomic statements. Thus, modus ponens is built directly into the system. A small discussion involves the use of "types" to represent conditional beliefs. (It appears that this "conditional" reasoning would need to be used with a potential infinite number of levels to do even classical reasoning).
A section describes how logic can be encoded in a production system. The authors describe this as a "Gentzenization" of logic. Reasoning by contradiction can be used to delete conditional beliefs and learn the negation of a root of the contradiction. This can also be done when an unwanted conclusion is reached (the conclusion need not be a strict contradiction). A Boolean "cut" rule corresponds to learning by chunking productions. In further sections the authors consider extending the logic to predicate calculus, including a discussion of the role of negation (a kind of closed-world-assumption is preferred).
The sources of errors and limits are discussed briefly. Deduction is compared to nondeterministic machines. The rules determine what can be proved, not what will be.
"Historical Addendum"
The prehistory of Cognitive Science (or, "information processing psychology") is considered in two steps: pre-WWII and post-WWII, and along two threads: psychology and formal logic.
Pre-WWII, psychologists were primarily behaviorists, except the Gestalt school of Europe. He notes a few exceptions consisting of psychologist interested in representations "inside the head." They also introduce the notion of the symbol level being a level of abstraction of the neurological level analogous to chemistry and physics, as opposed to reductionism which seeks to explain the mind in neurological terms. Virtually all American psychologists who were not behaviorists accepted reductionism.
Formal logic, developed early in the century, provided a prototypical example that thought processes (in particular, those involved in deducing mathematical truths) could, in principle, be represented as a manipulation of symbols. Ultimately, the computer (in abstract and in actuality) developed from these notions.
Post-WWII, developments include Wiener's "cybernetics" which combined information theory, feedback systems, and the computer. In other countries, cybernetics included game theory, mathematical economics, and management science and operations research. Some forerunners to cybernetics include Shannon's master's thesis (1938) using Boolean algebra to analyze switching circuits, the Pitts-McCulloch paper (1943) providing a Boolean analysis of nerve networks; and the Rosenblueth-Wiener-Bigelow paper (1943) on "Behavior, Purpose, and Teleology." All of these had roots in symbolic logic.
Also, in psychology during the fifties, work on organization, communication, and memory (including Miller's "Magic Number Seven" paper--1956) appeared. In linguistics, Chomsky's work related language to information processing. In pre-computer science, the first languages (including FORTRAN) were developed.
They consider 1954-1958 to be the critical years. Newell decided to build a chess playing program in order to test the computer's effectiveness as a symbolic, non-numerical device. This was to be an "AI" program using heuristics. By 1955, Shaw and Simon had joined Newell. In late 1955, the group decided a propositional theorem prover would be a more reasonable project and by December a flow diagram describe the Logic Theorist had been created. In 1956, this program was realized in the IPL-I (Information Processing Language I) programming language. It was implemented on JOHNNIAC, a computer at the RAND Corporation, and the first complete machine proof of a theorem of Principia Mathematica (Theorem 2.01) was produced on August 9, 1956.
During the summer of 1956, a research project seminar on AI took place at Dartmouth. It was organized by McCarthy, Minsky, Rochester, and Shannon. Attendants included those listed above as well as Gelernter, Newell and Simon. The LT program led to interest in a potential geometry theorem prover.
LT involved ideas from psychology and human behavior, which led to a first sketch of EPAM (Elementary Perceiving and Memorizing Program) in 1956. The performance of LT was compared with human performance and it became clear that the two were using different problem solving methods. Analysis of human problem solving led to a flow diagram of GPS (General Problem Solver) in 1957.
By the spring of 1958 research in AI by the RAND-Carnegie and MIT groups was well under way. Communication between these efforts and information processing psychology was, however, still weak. A summer seminar was organized at the RAND Corporation by Newell and Simon in order to help acquaint social scientists with computer simulation techniques. This seminar was effective in relating the Carnegie-RAND work with mainstream information processing psychology.
"Chapter 2: Information Processing Systems"
This chapter describes the organization and functions of an IPS (Information Processing System) and introduces notations for describing the structure and behavior of IPS's.
An IPS is defined as consisting of:
1. a set of elements called "symbols"
2. a set of "relations" used to connect "symbol structures" consisting of a set of "tokens" (equivalently, "instances" or "occurrences") of symbols
3. a "memory" for storing symbol structures
4. "information processes" which operates on symbol structures
5. a "processor" consisting of a fixed set of "elementary information processes" (eip's); a "short-term memory" (STM) for the input & outputs of eip's; and an "interpreter" that determines the sequence of eip's to be executed as a function of the symbol structures in STM
A symbol structure "designates" ("references"/"points to") an object if there exist information processes that admit the symbol structure as input or output and either affect the object or produce, as output, symbol structures that depend on the object.
A symbol structure is a "program" if the object it designates is an information process and the interpreter, if given the program, can execute the designated process.
A symbol is "primitive" if its designation is fixed by the elementary information processes or by the external environment of the IPS.
An "object" is either a symbol structure in memory, a process IPS is capable of executing, or an external environment of sensible stimuli.
Symbol structures may be represented in different ways, including as lists and as property lists ("attribute-value associations" forming an "association structure"). Examples are given.
Elementary information processes exist which are sufficient, with available composition techniques, to span all information processes (though this is not proven). These eip's include discrimination, tests and comparisons, symbol creation, writing symbol structures, reading and writing externally, designating symbol structures [and referencing designations], and storing symbol structures.
The notion of programs and interpretations of programs is discussed, and an example of a "thermostat" program in a simple programming language is given. Also, the boundary between the parts of a system which are (internal) mechanisms and those which are interpreted programs is discussed, especially in the area of recognition. Recognition may be performed by an interpreted program implementing a discrimination net such as EPAM.
A small amount of discussion is given to interpreting the "deep structure" or "meanings" of phrase-structured strings (from linguistics, namely Chomsky's theories) as symbol structures in an IPS.
Notation for describing Information Processing Systems in natural language (as opposed to a formal language such as IPL-V) is given, and the top level of the LT program is described in this way. A program may be described by sequential structures or by production systems (which leads to a discussion of BNF and a generalization of BNF).
"Chapter 3: Task Environments"
This chapter discusses adaptive or rational behavior of agents placed in an environment, and when that behavior gives information about the task environment and when it gives information about the agent. A theory of problem solving must include both a study of task environments (which induce behaviors) and a study of the internal mechanisms affecting (which may limit perfectly rational behavior) the agent's actions. These are "(1) the demands of the task environment and (2) the psychology of the subject."
The external environment must have an internal representation. He gives an example of a "Necker cube" (two-dimensional representation of a cube). The subject may internally represent it as a three-dimensional cube in two different ways, while the experimenter knows it is "really" two-dimensional. A distinction must be maintained between the stimulus and the subject's interpretation(s).
Examples of internal representations are given, including number scrabble and its equivalence to tic-tac-toe (although this is by no means immediate). It seems clear these games, although equivalent, have different (natural) internal representations. The natural representation of tic-tac-toe prevents the subject from considering illegal moves, while the natural representation of number scrabble does lead the subject to consider such moves. What is an "objective" representation the experimenter should use? "One possibility is to omit a description of the objective problem space, but to incorporate in the theory hypotheses about the internal representation the subject will himself use for the problem space." "A somewhat different possibility . . . is to construct a hypothetical problem space that is objective only in the sense that all of the representations that human subjects in fact use for handling the problem can be imbedded as specializations in this larger space." Of course, I'd like to see a category of representations and mappings between them which preserve the "important" properties of the representation.
Internal representations of task environments can be interpreted as linguistic deep structure. The authors adopt the viewpoint that the strings of natural language--their surface structure--is "inessential to human problem solving of the kinds [they] examine," and that "internal symbol structures that represent problems and information about problems are synonymous with the linguist's deep structure."
A detailed representation of number scrabble is given as an algorithm and data structures.
The general nature of a "problem" as a task environment with a goal and operators (actions). Two representations of problem spaces are discussed: Set Representations (generate and test) and Search Representations (apply operators--state space and action space).
Some discussion is given to the need to specify the intelligence (and limitations) of a problem solver within a task environment. For example, if a problem solver is to cross a deep chasm, then the problem solver may know how to cross a bridge but not how to invent an airplane. This includes a discussion of "task invariants."
"Chapter 4: Problem Solving"
The beginning of the chapter gives an overview of problem solving. There is a diagram giving the general organization of a problem solver (which includes choosing an internal representation and choosing and applying methods--the latter are concentrated upon). Difficulty and the recognition method are discussed. The two representations given earlier are associated with methods. The set representation is associated with the generate-and-test method (an extreme example of which is to prove a theorem by generating all strings in some order and test each one to see if it is a proof of the theorem; this can be slightly improved upon by generating valid proofs and testing each to see if it is a proof of the theorem ["British Museum Algorithm"]). This method allows no interaction between generation and testing. The search representation is associated with the heuristic search method. Each time an operator generates a new element, that new element is evaluated to determine if it should be searched upon (made current), the current element should be searched upon further, or some other untried-problem should be searched upon (made current).
The Logic Theorist (LT) is described at some length as an example of a heuristic search problem. LT performs a backwards search from the propositional formula to be proven using the operators: Detachment (DT), Forward Chaining (CF), and Backward Chaining (CB). Each of these operators requires a parameter from the list of axioms and known theorems. For example, (ignoring matching) DT[A=>B](B)=A. These operators build lists of formulas which logically follow in sequence. To successfully complete the goal, an operator must reduce the formula, via the operators, to a known axiom or theorem. Matching via the Substitution Test (SB) is used extensively to decide if a formula can be made to equal another (via substitution of variables or replacement of connectives [e.g., A=>B by -A \/ B]). This is used both to test for recognition of a subproblem as an instance of an axiom or known theorem, and also to apply operators.
The LT search system is not complete. In particular, it cannot prove P \/ ---P. This theorem cannot be proved because of the unidirectional application of detachment.
LT demonstrates a common use of selectivity. Namely, if we search for an object satisfying properties p1, p2, and p3, then we should generate only those objects which already satisfy two of these (the "optimal" two) and test for the third.
The complexity of LT is roughly compared with BMA (British Museum Algorithm), giving an indication of why LT is better.
The matching process is described in more detail.
LT naturally performs a breadth-first search. Heuristics to prune out portions of the search tree are briefly described ("stop rules") and the effects on the search are given. Also, the effect of the list of known theorems on search is discussed. Also, possible heuristics to scan the untried-problem list to search on the most "valuable" formula is discussed briefly.
Another heuristic uses abstraction by ruling out matching "dissimilar" formulas (the "similarity test")--where similarity is determined by describing the formula in terms of number of & number of occurrences of variables on the LHS and RHS of a formula. This heuristic does not seem to help a great deal since it usually rules out matchings that would have been ruled out quickly by the matching procedure anyway.
Processing effort in terms of eip's are given for the nontrivial theorems proven from Chapter 2 of Principia Mathematica. This provides a way to evaluate heuristics such as the similarity test. It also leads to the conclusion that processing effort is, generally, exponentially related to proof length.
Two possible ways LT could learn from experience are discussed: (1) The known theorem list could be expanded when new theorems are proven. (2) A list of known theorems could be associated with each of the DT, CF, CB operators (those which have proven useful with that operator on previous problems) and these associated theorems checked first (falling back on the other theorems as a backup). This second option was implemented and performance compared on the nontrivial theorems proven from Chapter 2 of Principia Mathematica. About half the problems were solved twice as fast, with the other half solve twice as slowly. This is because performance improved on theorems with proofs similar to proofs seen before and performance was hampered on theorems which required novel proofs.
"Chapter 5: Cryptarithmetic: Task Analysis"
The task of cryptarithmetic is introduced, via the problem DONALD+GERALD=ROBERT, where each letter represents a distinct number between 0 and 9 making the corresponding sum true (and we are given D=5). Such a problem can be attacked in several different problem spaces, some of which are given here.
The basic problem space consists of a depth first search of possible assignments, pruning out paths when the partial assignment is inconsistent with the sum (or some letter or digit is used twice in the assignment). The only knowledge at each state is a partial assignment, and the only operators are assignment of a letter to a digit. The search tree involves 252 branches and finds the solution. This search can be modified to infer the value of letters which are the only unassigned letters on a column, resulting in a search tree of 23 nodes.
An augmented problem space is described. The states here also contain information about carries (as well as letters, taken together to be "variables"), and more complex relationships between the variables and digits than just equality (including various inequalities, parity, and a list of alternative possible values).
Finally, an algebraic problem space is considered, in which each of the letters and carries are variables, and each column forms an equation (so we have six equations). The two basic operators are "substitute" (for when we know a variable's value) and "infer-consequences." Assuming a powerful enough inference engine (with a sufficiently powerful language to express an "inference"), the problem can be solved in this space with no search!
The author also gives examples of problem spaces which do not fully represent the problem (so that solutions are only "plans"--or conditions on possible solutions). These are (inductive) "abstraction" spaces, and can be useful for narrowing the search space in the original problem space. A particular abstraction space considered only retains information about a letter's parity and whether it is high (> 4) or low (< 5). The results of this search narrows the possible solutions from 9! to 1680. The search tree (in the subspace of the basic problem space which must be searched using the constraints from the abstracted space) had 17 branches (a reduction of the 23 and 87 branches in the basic problem spaces considered).
"Chapter 7: Cryptarithmetic: A Broader View"
In Chapter 6 the notion of a PBG (Production Behavior Graph) is introduced, essentially as a trace of a subject's traversal through a problem space (induced from data such as verbal protocols). This notion was used to analyze the verbal protocol for a subject S3 who solved the DONALD+GERALD=ROBERT cryptarithmetic problem. This subject worked in the augmented problem space. The PBG is used to induce production rules which simulates S3's behavior on the problem.
In this chapter (Chapter 7), the protocols of three other subjects (S4, S5 and S8) who worked in the augmented problem space and a subject (S6) who worked in a basic problem space are studied. Also, much of the protocol for S8 for the LETS+WAVE=LATER problem involves a search for a good problem space in which to search. This information is analyzed to suggest the different problem spaces which are considered for a cryptarithmetic problem.
The protocols for the DONALD+GERALD=ROBERT problem are abstracted into episodes. Each episode is a sequence of moves which give "obvious" information. The notion of "obvious" is made somewhat precise. Thinking of the problem space as a category, some processes (arrows) out of a state (object) lead to "obvious" information, and these are expected to be followed (presumably there is an assumption here that the processes can be carried out in any particular order...?). This leads, actually, to a more abstract problem space--a deductive abstraction in fact! The objects are the "heads" (i.e., objects from which there are no obvious arrows out) and the arrows are the arrows between heads. This idea deserves more thought . . . (Also, consider conflict resolution in a production system. Perhaps a "head" object is a state in which there is a nontrivial conflict.)
Episode abstracts of the protocols for S3, S4, S5 and S8 are given. These are compared and contrasted. The conclusion is that although the paths are divergent, the fundamental programs the subjects used to traverse the augmented problem space appear to be very similar. The divergences are mainly caused by different priorities for strategies and errors being made at different points in the process.
An episode abstract of a production system is given and compared to S4, then the production system is slightly modified to compare to S3, then S8. The idea is that essentially the same program (with only slight variations) models them all.
Next, the theory is generalized by considering the cryptarithmetic task CROSS+ROADS=DANGER. It is somewhat more difficult than the DONALD+GERALD=ROBERT task. Episode abstracts for S4 and S2 are given. Both of these subjects worked in the augmented problem space.
Subject 2 performed the task while wearing a helmet to monitor his eye-movement. The eye-movement data was aggregated into attention units, scan units, transition units, and excursion units, and used, along with the verbal protocols to induce a production rule system to match S2's behavior during the task. The closeness of the match is discussed.
"Chapter 14: The Theory of Human Problem Solving"
This chapter provides a review of the book, summarizing the results and conclusions concerning human problem solving as an IPS (Information Processing System). The key concepts are those of the task environment, problem space, and program. The invariants of human IPS's (such as limitations on STM, the structure of LTM, elementary processes, etc.) are discussed. An argument is given to support the production-like and goal-like characters of programs used by human problem solvers.
A section is dedicated to the notion of a problem space. It is defined in terms of (immediate and extended) knowledge states, operators, and initial and goal states. Invariant features of (human) problem spaces include finite generation (with closure of knowledge), small finitely generated set of operators, limited backup storage (of about two nodes), short residence time at each knowledge state (on the order of seconds), incremental search with occasional backup, and moderate knowledge state size. An interesting discussion of models ("dynamic" information) vs. descriptions ("static" information) is included in this section.
Next a section on task environments discusses how information from the task environment is embedded in problem spaces in order to select operators, evaluate knowledge states, deciding when to save knowledge states for backup, and when to back up. There is some formalism developed with respect to choosing operators and evaluating knowledge states. Also, the notion of a factorable problem space is defined. The notion of using variables as parameters to operators (and states) and instantiating the variables at some later point in search is briefly introduced.
Next, a section on methods and programs discusses how a problem space might be traversed. An attempt is made to restrict the methods from the space of all programs by considering many methods functionally equivalent and by considering mixtures of methods.
Finally, a sketchy section is included on how individuals determine problem spaces and programs. Though many invariants in the task environment and in the human IPS restrict the problem spaces and programs considered, there are also task differences and individual differences (which depend upon the contents of LTM) which make absolute conclusions difficult. Still, the authors give ideas of what information could be used to construct a problem space and program. This information includes task instructions, previous experience with the same or analogous tasks, general programs in LTM (such as GPS), programs in LTM for constructing spaces and programs, and methods for modifying or augmenting the problem space or program (or both) during problem solving.
2.3. Simon, H. A. "The Information Storage System Called 'Human Memory'"
This section describes current (in 1979) theories about human memory and some of the key experiments which led to certain conclusions. It also describes the relationship between information processing psychology to physiological psychology by discussing the abstraction from the neural level to the symbol level (using the analogy with physics and chemistry). The argument given is not against reductionism in principle, but only in practice. Also discussed is the scientific methodology for inferring the existence of abstract entities--as molecules were postulated as part of a theory to explain observable phenomena (so, this type of inference is inductive!). He argues that the structures postulated by information processing psychologist should give physiological psychologists phenomena to describe in terms of the neural level.
Distinct "memories" (such as LTM vs. STM) can be characterized by "(1) the kinds of inputs that can be stored in it, (2) the time required to store new information in it, (3) the time required to access information previously stored in it, (4) the durations over which information, once stored, is retained, (5) conditions that cause loss of information over a period during which it could otherwise be retained, (6) the qualitative nature of the deterioration of stored information, (7) the nature of the cues needed for accessing stored information, and (8) the form of organization of the stored information." Distinct memories need not be located in different areas of the brain.
The existence of distinct memories has been inferred from experiments from the following paradigms:
(1) Immediate recall (of an array of letters)
(2) Sperling paradigm (an array of letters is presented, then a tone is presented which
indicates which row [beginning, middle, end] of the array is to be recalled; results vary with the time interval between the array presentation and the tone)(3) Delayed recall (like immediate recall, but with an intermediate task)
(4) Rote verbal learning (longer exposure time or a number of successive trials and a response delay with a possible intervening activity; performance varies with total exposure time, response delay, and the nature of the intervening activity)
The experimental results suggest that auditory memory (STM) is being used in (1) [giving confusion of acoustically similar words], while visual memory (STM) is being used in (2). In particular, we have distinct memories. The results of (1) and (3) contrasted with (4) inductively implies the existence of another distinct memory (LTM).
Experiments have been done to determine the parameters of these memories. STM has the capacity to hold a few (about four?) chunks at a time, where a chunk is anything that can be recognized by LTM (ie, it is familiar). STM also has a short fixation and retrieval time. (He also notes here experimental results which imply a reader visually recognizes a word, retrieves the associated aural word, which is then used to retrieve the meaning.)
LTM seems to be virtually unlimited, and can link two chunks into a new chunk on the order of five to ten seconds. There are two processes involved in adding a new chunk to LTM: familiarization with the new information in semantic memory and (possibly more importantly time-wise?) elaboration of the recognition memory so it can be accessed. Also, chunks are stored in structured forms, so that, for example, if a chunk represents a list, the beginning of the list takes longer to access (about 2 seconds) than the succeeding members (about 300 msecs each).
LTM may be partitioned into two components: recognition memory (index) and semantic memory (encyclopedia). Information processing models of recognition memory include EPAM, SAL, and MAPP. Models of LTM concerned with semantic memory include the systems of Linsday, Reitman-Grove-Shoup, Quillian, Rumelhart-Linday-Norman, Shank, and Winograd. The Anderson and Bower HAM model contains both components. These theories postulate that recognition memory is in the form of a discrimination net and semantic memory is in the form of node-links or property-lists.
Experiments (for example, having subjects associate a new list of words with a highly memorized list of words) have shown that updating recognition memory takes longer than updating semantic memory. Forgetting is generally thought to be a loss of the index pointing to semantic memory rather than the information in semantic memory itself (for example, if the pointer at the terminal node of a discrimination net is changed).
There is a discussion of the modality of stimuli and their internal representation, particularly with respect to visual imagery. Experiments dealing with the "mind's eye" and their implications are discussed. The modality of memory is characterized by context (encoding of features extracted from a particular sense's input--"the sorts of stimulus features they record") and topology (how the corresponding structure is organized in memory--"the sorts of relational structures they employ"). One surprising result is an experiment by Brooks (1956) which varied the sensory modality of stimulus and response and found that response is faster when the sensory modality of the response is different from that of the stimulus. Of course, there must be the ability to translate between modalities (almost certainly with loss of information).
Some have postulated the existence of an "intermediate memory" between STM and LTM. For example, some experiments indicate that while recall can be diminished by an intermediate task, recall can recover if the subjects are given enough time to "get back into context." The loss of an index pointer rather than the information itself can provide an explanation for this phenomena which does not involve a new form of memory. They hypothesize the following sequence of events:
"1. A stimulus is presented, and its familiar components are discriminated by the recognition memory.
2. The internal symbols representing these components are stored in STM.
3. A new node-link structure relating these symbols is stored in semantic memory.
4. The recognition memory is modified to incorporate a new terminal node denoting or 'pointing to' the new node-link structure in semantic memory."
The first three steps seem to occur in a second or less, while the fourth step requires about ten seconds per chunk. It is possible, then, that when the intervening activity occurs, enough information is in semantic memory (due to step 3) to carry out step 4. If the interfering stimulus is similar to the original stimulus, the recognition mechanism may be interfered with, causing the loss of the index pointer. If the interfering stimulus is dissimilar, the distinct regions of the recognition memory may be modified independently.
STM has been shown to hold a small number of chunks and usually encodes its contents in terms of auditory features. All other points about STM are controversial. It is suspected STM functions as a working memory (like registers) for elementary information processes, which would imply the serial nature of human attention. An interesting open question is whether STM also holds control information as well as the inputs and outputs of elementary information processes.
Hierarchic control system such as those used by most programming languages (which require stacks) are contrasted with control by production systems (which does not require stacks). The conditions of productions would presumably depend upon the contents of STM. The actions of productions could place encoded sensory perception information into STM ("perception"), produce a symbol designating a goal or subgoal and place it in STM ("goal creation"), retrieve symbols from LTM into STM ("recall"), and rearrange symbols in STM, making a certain one primary ("rehearsal"). Rehearsal is probably necessary to keep an item in STM.
4.5 "The Functional Equivalence of Problem Solving Skills" (1975)
This section discusses different strategies for solving the Tower of Hanoi problem. These include goal-recursion, perceptual, sophisticated perceptual, and move-pattern strategies. Each strategy is encoded as a production system and the demands such a system would make on the STM and perceptual test abilities of a subject are made explicit. Then it is noted that each of these production systems is an instance of GPS (General Problem Solver), which finds a difference between the current and goal states, finds an operator to remove that difference, applies the operator if its preconditions are met. If its preconditions are not met, then the main difference between the current state and the preconditions is found, an operator is found to remove that difference, and so on.
Some discussion is given to the concepts which the subject would have to have (or induce) in order to implement the strategies. These include the program organization (GPS), the "pyramid" concept if goal-recursion were to be used, perceptual predicates (which are defined via Boolean connectives in terms of concepts given in the task definition), and move patterns if the move-pattern strategy were to be used.
4.4 "Cognitive Processes in Solving Algebra Word Problems" with Jeffery M. Paige (1966)
This paper compares Bobrow's computer program STUDENT which sets up algebra word problems via direct translation with human subjects asked to set up algebra word problems. The direct translation method of STUDENT is described, and this turns out to model at least one of the subjects quite well on the practice problems (which were noticably simpler than the test problems). Some discussion is given to the differences exhibited by individual subjects setting up the practice problems.
This leads to an interesting linguistic discussion concerning the distinction between conventional naming (e.g., as a variable) and relational naming (e.g., as a function of other objects). Also, there is no simple general rule for deciding when two "names" (noun phrases) refer to the same object. Syntactic similarity can be misleading--as in the "number of quarters" versus the "value of the quarters."
One of the main differences between human subjects and a direct translation mechanism such as STUDENT involves the human subjects' use of "auxiliary representations and cues." Many algebra word problems (including the test problems given in this paper) require some auxiliary knowledge to be possessed and applied by the subject. The authors play a dirty trick on their subjects by having some of the test problems represent impossible situations and noting if the subjects notice this. If the subject performs direct translation, he/she will be less likely to notice than if he/she uses auxiliary information and cues. Some of this auxiliary information can be in the form of physical assumptions (in which the problem can be represented as a diagram).
Several high school students perform the task (poorly) and are classified based on two dimensions, whether they are "good" or "poor" problem solvers (relative to the group!) and whether they use a "physical" or "verbal" translation of a problem which literally represents a physical impossibility but can be naturally converted into forms which are not impossible. The students drew diagrams corresponding to their equations after the experiment and these were compared with their classifications.
Finally, some discussion is given to the use of auxiliary representations to acquire information which is implicit in the representation (though not given explicitly in the problem statement). A mixture problem, in which conservation assumptions are the key to solving the problem, is the prime example.
5.2 with Kenneth Kotvosky, "Empirical Tests of a Theory of Human Acquisition of Concepts for Sequential Patterns (1973)"
This paper concerns the Thurstone Letter Series Completion Tests, especially experiments in which the subjects could view one letter of a series at a time (and this information was recorded, along with a verbal protocol). Computer models were also built to test hypotheses about how subjects induce the patterns.
The hypotheses concern the processes (subjects first discover periodicity, then induces a rule using relations same, next and backward next [predecessor]--experiencing trouble if the subject must keep positions on more than one list), mechanisms (the relations and alphabet are known, a new list ["alphabet"] can be induced, pattern descriptions can be generated based on lists of the relations, the sequence can be extrapolated from the description), individual differences (that minor modifications of the computer program account for such differences), and sources of human error (unfamiliarity with alphabet or relations, spurious relations and inability "to organize and record the relations . . . in a coherent pattern description").
The experimental problems are ranked with respect to difficulty. The first feature noticed is noted, providing an ordering on the relations same, next, and backward next (including broken sequences and periodic relations). The overall fit of the data to the theory is discussed (along with a few errors subjects made which are not accounted for by the theory). There is also some information that some subjects used hierarchic patterns and counting procedures.
5.5 with Glenn Lea, "Problem Solving and Rule Induction (1974)"
The authors demonstrate how the problem of rule induction can be formulated in a framework closely related to that of problem solving. The main difference is that problem solving involves search in one problem space and rule induction involves search in two problem spaces (one for the rules and another for testing instances). The cryptarithmetic task is used as an example of a problem which can be interpreted as either a problem solving task (as it is in Newell and Simon) or as a rule induction task (trying to induce the "rules" associating letters with digits). In the latter interpretation, one space (the instance space) uses the column representation and the other (the rule space) uses a substitution representation.
The notion of a problem space is formalized as a collection of "knowledge states", "generative processes (operators)", "test processes", and "processes for selecting which of these generators and tests to employ." Along these lines, the generate-and-test and heuristic search are briefly described, along with the means-ends and matching submethods. A taxonomy (given by Newell) describes systems which induce rules by using "separate generators for rules and instances and a match process to test whether an instance agrees with (is associated with) a rule." A diagram shows the basic modules of such a system. It is noted that sophisticated systems may use feedback from instance tests to select a new rule (or modify the old rules) and may use feedback from the rule induction to select instances (if this is possible in the experimental setup) to test the new hypothesis. These ideas are used to propose an executive program called GRI (General Rule Induction Executive Program). This is analogous to GPS in the domain of problem solving.
The problems of concept attainment, extrapolation of patterned sequences, and induction of grammars from examples and feedback are shown to be solvable in the GRI framework. Also, it is noted that the Tower of Hanoi can be a rule induction problem if the instructions ask for a rule for transferring disks to solve the problem (especially if the rule is supposed to work for an arbitrary number of disks).
In the end, the authors note that GRI can be used as a framework for many rule induction tasks simply by instantiated the general method with the specifics of the problem.
6.4 with William G. Chase, "Perception in Chess (1973)"
Experiments involving reconstruction of a chessboard configuration were performed with a Master, a Class A, and a beginner level chess player. Some of the experiments were performed with middle-game and some with end-game positions. Some were performed with random configurations (starting with a middle- or end-game position and randomly moving the pieces). The experiments were also partitioned into memory experiments (in which the subject was exposed to the configuration for 5 seconds, then asked to reconstruct it) and perception experiments (in which the subject could refer to the configuration by glancing at it). The point was to determine the nature of the chunks used by the different subjects.
The Master demonstrated the best performance on the tasks using actual chess positions, with the Class A player next and the beginner last. However, all three subjects performed equally poorly on the random chess positions.
The hypotheses are the following:
(1) In the perception task within-glance placement of pieces corresponds to a chunk while between-glance placements represent the boundaries of the chunks. It takes about 2 seconds to recognize a chunk in a glance.
(2) In the memory task, chunks correspond to pieces placed successively within 2 (or it could have been 2.5 with similar results) seconds, while pauses of greater than 2 seconds indicate boundaries between chunks.
The relations among pieces which are examined in this paper are attack, defense, same color, same piece, and proximity. Results are shown which support the general idea that the more relations there are between two pieces, the more likely they are to be in the same chunk (where chunks are determined by the hypotheses above).
It appears that, especially in the middle-game positions, the Master player builds larger chunks. He may also store more chunks in STM (though alternative explanations for this are suggested). Also, a closer examination of the actual chunks reveal most to be of similar types, such as pawn chains, castled-King configurations, first-rank configurations, Rook pair and Rook-Queen configurations, and various attack configurations.
2.4. Simon, H. A. and Zhang, Guojun. "STM Capacity for Chinese Words and Idioms: Chunking and Acoustical Loop Hypotheses," Memory and Cognition, 13: 193-201 (1985).
This paper concerns six experiments with Chinese radicals, characters, and words using Chinese speakers as subjects. The point was to reconcile Miller's "seven+-two chunks" account for the capacity of STM and Baddeley's "two second rehearsal time" account. Every Chinese word they used is made of two characters and each character is made of two radicals. Each radical does not have a prononcable name, although each character has a 1-syllable prononcable name (making each word 2 syllables). Since each radical, character, and word is familiar to the subject, each would count as a chunk.
Experiment 1: In the first experiment, lists of radicals, characters, and words are given, each of increasing length (from 2 to 5). After brief presentations of sequences, the subjects wrote down the sequence from memory. Scores for radicals, characters, and words were each computed by K+0.5N where K is the highest number such that the subject got all the sequences <= length K correct and N is the number of sequences greater than length K the subject got correct. This score should give an idea of the capacity (in chunks) of STM. Results:
Radicals 2.71
Characters 6.38
Words 3.83
An explanation would be that radicals had to be stored in visual STM, while characters and words were stored in auditory STM. The difference between characters and words is explained later.
Experiment 2: Sequences of radicals and homophonic characters were used. Results:
Radicals 3.00
Characters 2.83
Since the characters were homophones, visual STM had to be used, which has a smaller capacity than auditory STM.
Experiment 3: Characters for family names were used. The subjects were told the characters were family names. Some of the characters had no homophonic family names, others had homophonic family names which were used (although most had only two homophones). The results were scored for graphemical as well as phonemic correctness. Results:
Graphemically Phonemically
Correct Correct
Nonhomophones 7.08 7.67
Homophone 5.33 7.92
The fact that the homophone data is higher than the homophone data in experiment 2 is probably due to "the fact that family names seldom have more than two homophones, whereas words have, on average, six." The dominant homophone is most likely to be recalled.
Experiment 4: Homophones were also used here, but were partitioned into first-class (most dominant), second-class (next most dominant), and third-class (those with at least four more dominant). Results:
Graphemically Phonemically
Correct Correct
First-class 5.50 7.25
Second-class 4.08 6.75
Third-class 2.67 5.92
This supports the idea that the most dominant homophone will be recalled (using the data in auditory STM) as well as the idea that visual STM stores about three chunks.
Experiment 5: Here twenty spoken lists of Chinese digits and twenty spoken lists of English digits were recorded onto a tape. These were played to native Chinese speakers who know English as a second language. Results:
Chinese digits 9.50
English digits 5.67
This suggests familiarity of material plays a role in STM capacity.
Experiment 6: These immediate recall experiments used characters, words, and idioms (four characters, hence four syllables). The idea is to reconcile Miller's number of chunks capacity theory and Baddeley's rehearsal time capacity theory.
Results:
Characters: 6.58
Words: 4.58
Idioms: 3.00
It turns out this data can be fit by a curve of the form C = T/[a+b(S-1)] where C is the number of chunks recalled, T is the rehearsal time (2 seconds according to Baddeley), S is the size of each chunk in syllables (assuming constant size, as in this experiment), and a and b are parameters where a represents the amount of time to access the chunk and b represents the amount of time to articulate (in rehearsal) each syllable beyond the first. This reconciles the two theories of STM capacity.
3.2 with Yuichiro Anzai, "The Theory of Learning by Doing," Psychological Review, 86:124-40 (1979).
The authors use the protocol of a subject who learns to solve the 5-disk Tower of Hanoi (in four attempts--the last three of which were all successful) to propose a general model of learning by doing. The subject's protocol was divided into four episodes. In the first, the subject selected moves in a way to avoid repetition and loops (the "selective search strategy"). In Tower of Hanoi, this determines all the moves after the first. The subject made the incorrect first move, which led her down an incorrect path. She eventually notices this and returns to the initial configuration. The second episode begins by making the correct first move (so that the former search strategy will lead to a solution). She also begins to mention subgoals in this episode (the "goal-peg strategy"). After solving the problem, she starts over. At the beginning of the third episode, she experiments with the one-disk, two-disk, three-disk, and four-disk problems to test her understanding. This led to the "recursive subgoal strategy" which solved the problem. The experimenter asked her to solve it a third time. In this fourth and final episode, she begins to notice the higher-level perceptual units of "pyramids" and begins to use the "pyramid subgoal strategy." The protocol suggests that during each episode the subject applies the current strategy, gathers information, uses information gathered in a previous episode, and decides when to terminate an attempt.
The authors first simulate the subject's behavior in each episode with fixed production systems. They then note what information the subject gathered during solution attempts and categorize this knowledge. This leads to an adaptive production system which on odd runs accumulates corresponding knowledge and uses this knowledge to produce a new production system. The new production system is run without learning on even numbered runs. These runs roughly correspond to some of the subject's episodes. Some of the techniques for generating new productions from specific knowledge gathered from a solution attempt look like EBG (they replace constants with variables). New productions are generated to avoid bad moves (corresponding to selective search strategy), setting subgoals when the current state satisfies certain tests and a certain corresponding goal is active (corresponding to goal recursion strategy), chunking moves and perceptual chunking (corresponding to the pyramid subgoal strategy).
In a discussion at the end of the paper, the important role of "knowledge of results" is emphasized. Such knowledge is needed to know which moves are "bad" and should be avoided and which moves are "good" and could be used to suggest subgoals in the future.
3.3. with Jill H. Larkin, "Learning through Growth of Skill in Mental Modeling," Proceedings of the Third Annual Conference of the Cognitive Science Society, Berkeley: Cognitive Science Society, 1981.
The behavior of the ABLE system is compared with the behavior of four human problem solvers who read a section of a physics text dealing with fluid statics and is then asked to solve a problem. The ABLE system is a production system (written in OPS) which is capable of using declarative knowledge ("principles" such as equations which relate certain types of quantities in restricted contexts--so that the conditions of the contexts must also be representable) to solve problems. It is also capable of
(a) Proceduralization: Lifting "principles" (declarative knowledge), after it has been used during problem solving, into a production (i.e., procedural knowledge)
and
(b) Composition: Compiling two proceduralized productions into a single production.
The results of the human learners were partitioned into two "more able learners" and two "less able learners." The "more able learners" read the entire section, read slower, and verbalized more than the "less able learners." In problem solving, the "more able learners" made fewer (and more reasonable) errors, solved the problem without search, and considered principles in an order like ABLE. The "less able learners" made more errors (including applying principles out of context simply because the "types" of the variables matched) and "solved" the problem by performing search in a means-ends analysis style.
The conclusion is that the "more able learners" took the time to form more involved mental models which allowed for sound matching of principles to contexts and allowed proceduralization and composition of principles so that the problem solver would have immediate access to information which can be directly derived from the "proceduralized+composed" principles in an appropriate context.
3.4. Feigenbaum, Edward A. and Simon, H. A. "EPAM-like Models of Recognition and Learning," Cognitive Science, 8:305-336 (1984).
This paper reviews the "state of evidence about the adequacy of EPAM" as theories of human recognition and rote learning. It is partially a response to a paper by Barsalou and Bower (1984) which gave "a lengthy critique of discrimination nets as psychological models." Many of the criticisms in the Barsalou and Bower paper stem from the (untrue) assumption that EPAM's discrimination nets are binary (rather than n-ary) trees. The present paper does not limit itself to discrimination nets however, and discusses both other aspects of EPAM's architecture and also proper methodologies for judging theories (without relying on subjective judgements such as "psychological implausibility").
The architecture of EPAM-III is described, consisting of STM and LTM (with discrimination net "index" and semantic memory "encyclopedia"). EPAM-III learns via familiarization (gradual build-up of images of objects--including adding properties and subobject tokens) and discrimination (expansion of the discrimination net).
Several positive accomplishments of EPAM in simulating human memory are listed, including:
EPAM-I,II:
1. Serial Position Curve
2. Constant fixation time for a single item
3. One-trial learning
4. Forgetting (inducing oscillation and retroactive inhibition)
EPAM-III:
5. Fixation time per chunk of about 8 seconds
6. "Effects of intralist and interlist stimulus and response familiarity"
7. More detailed explanation of one-trial learning
MAPP (PERCEIVER + EPAM-like component):
8. Eye movements of expert recognizing chess patterns
Each of these is discussed in a bit more detail in order to account for what components of EPAM's architecture contribute to these successes.
The authors talk a bit about falsification of theories in both a philosophical sense and as it applies to the evidence against EPAM used by Barsalou and Bower. The authors deal with each of the criticisms of Barsalou and Bower in order. These criticisms are organized by:
1. EPAM's use of negative properties (The n.e.c branch is discussed, and there is also some loose talk about classes and levels of abstraction, but no discussion of how classes are formulated in the first place. In particular, the cat-->marmalade cat-->Cinnamin example is here.)
2. EPAM's sensitivity to those properties which lead to discrimination
3. EPAM's "sensitivity to missing or incorrect properties" (EPAM uses the n.e.c. branch to deal with this possibility.)
4. Multiple knowledge domains (here it is noted that the "number of nodes in an EPAM net grows only linearly with the number of objects discriminated")
5. Seriality (vs. Parallelism)
6. Comparison of EPAM with other systems (such as PANDEMONIUM)
3.5 with Xinming Zhu, "Learning Mathematics from Examples and by Doing (1988)," Cognition and Instruction, 4:137-66 (1987).
This paper concerns learning from examples (i.e., generalizing worked-out examples) and learning by doing (i.e., finding a solution by search, then learning from that solution). Learning by doing is synonymous with discovery learning.
The main experiment discussed involves algebra students in a Chinese middle school learning to factor quadratics (with integer coefficients). The experiment was run both on classroom students and a smaller group of protocol students (students who provided verbal protocols during problem solving). Some of the protocol students learned from examples while others learned by doing. The procedure of the experiment was as follows: pretest, review prerequisite knowledge, second pretest, learning from examples or by doing, and final test.
Before the experiment was performed, a production system was written to perform the factorization task. It was assumed that the students would learn analogous productions, and the experiment was designed so the student could induce the productions from the examples in a reasonable order.
The students were able to learn in this manner (as evidenced by the protocols and demonstrated on the final test) and in a short time. Furthermore, there is evidence that the students learned with understanding instead of performing rote memorization. The evidence for this includes the ability to use the productions to solve problems, the ability to relate factorization to multiplication, the recognition of the sum and product rules, recognition of sign patterns, and the ability to verbalize the factorization procedures. (I am unconvinced that any of these measures "understanding," but I believe I have a different notion of "understanding" in mind. The authors' take "understanding" to be "whether [the subject] can use [knowledge] in appropriate ways.")
Additional experiments are briefly discussed. These include replications of the factorization task using experimental students and control students (learning in a traditional "instructional" environment), exponent and geometry tasks, and ratios and fractions. In each case, the experimental group learned in less time and performed better on the final test (though the difference was not always significant). Another experiment involved retesting factorization and geometry after one year (the purpose being to test retention, which would presumably be longer if the student learned with understanding). The experimental groups performed better on the retest than the control groups.
Finally, a Chinese middle school experimented with a three year curriculum teaching algebra and geometry using these techniques. These students performed slightly higher on standardized tests than those who went through the regular curriculum.
It is noted in the conclusion that these techniques could be used to design computer aided instruction programs.
4.3 with Dorothea P. Simon, "Differences in Solving Physics Problems (1978)."
This chapter concerns an experiment in which two subjects are asked to read a chapter in a high school physics text on uniform acceleration and linear motion. The meat of the chapter is contained in 11 equations (reprinted in Simon's chapter) which relate distance, velocity (constant/initial/final/average), acceleration, and time. One of the subjects (S1) is considered an "expert" while the other (S2) is considered a "novice." Protocols of the two subjects are taken as they solve problems at the end of the chapter.
(S2) takes about four times as long and says about twice as much as (S1). Production systems are constructed which simulate the behavior of the subjects. The productions essentially consists of matching the independent and dependent variables of equations and instantiating the values in the equations to obtain the dependent variable if the conditions are met. The differences in the production systems include a reordering of the equations and the fact that (S1) works in more of a "Forward Search" mode while (S2) works in more of a "Backwards Search" (or "Means-Ends Analysis") mode. That is, (S1) takes the given variables and immediately calculates related variables (without immediate concern for which variable is asked for by the problem) while (S2) only invokes a production when the dependent variable is "wanted" and the independent variables are known. The authors speculate that another difference between the subjects is that (S1) uses more physical intuition than (S2). That is, (S1) translates the problem statement into a structure which represents the physical situation. This leads (S1) to consider equations which have a physical interpretation in his structure (often ignoring equations which do not have such a physical interpretation but would lead to shorter solution paths). When (S1) does use equations which do not have a physical interpretation, he often checks his answer using the equations which do have such a physical interpretation. (S2) does not exhibit similar behaviors.
One problem is singled out for special attention since it involves simultaneously solving two equations with two unknowns. This problem, in particular, demonstrates a weakness of only working backwards as (S2) generally did. (S2) took over eighteen minutes to solve the problem, taking over 10 minutes to evoke an appropriate equation (since the dependent variable in the equation was not "wanted" before this point).
(S2) exhibited some modest learning during the task. In particular, at the beginning she had to frequently refer back to the chapter to remember the equations, but by the end the equations could be recalled from LTM. There is also sketchy evidence that she began to employ physical intuition.
4.8 "Why Are Some Problems Hard? Evidence from the Tower of Hanoi" (1985) with Kenneth Kotovsky and John R. Hayes, from Cognitive Psychology 17:248-294.
This paper examines isomorphs of the three-disk Tower of Hanoi problem (starting five-moves from the goal). These include problems with monsters who move globes according to rules ("Move" problem) and monsters who grow and shrink globes according to rules ("Change" problem), as well as other isomorphs. A number of experiments were performed to examine why some isomorphs are more difficult for subjects to solve than others.
The Rule Learning Hypothesis is tested. This states that the more difficult rules are for a subject to learn, the more difficult the problem will be. Also, the Rule Application Hypothesis is tested. This states that the more difficult a rule is to apply (correctly), the more difficult the problem will be. The rules for the "Change" problem are more difficult than those for the "Move" problem, since the "Change" rules require more imaging. This is reflected in the data, supporting the hypotheses.
Also, experiments show that problems with rules which are consistent with "real world knowledge," the easier they are to solve than others. This is tested by using the Acrobat (larger acrobats cannot be on smaller) and Reverse-Acrobat (smaller acrobats cannot be on larger) problems. Transfer of problem solving knowledge (and its [strong] dependence on rule similarity) is discussed along with experimental results.
The usefulness of external memory aids is tested. The standard Tower of Hanoi problem is much easier than many isomorphs, at least in part because the physical representation makes testing the legality of moves quite easy. Using monsters models (with "hook" hands) and balloons (capable of being inflated and deflated), subjects solve the Move & Change monster problems. Also, some subjects are trained in making the moves before attempting the task. The training consistently improved performance. The models improved performance on the Move problems, but not necessarily on the Change problems. Other tests were performed using "Dish" versions of the problem which varied the extent to which the rules were modeled in the physical representation. Clearly, the more the rules were reflected, the easier the problem became.
It is noted that the subjects' behaviors could be partitioned into an exploratory period and a final path which led to the solution. The reason appears to be (it is argued) that during the exploratory period, subjects learn how to make moves and chunk two moves together. This frees the memory load and allows the subject to plan. Once the subject is able to plan, the solution is readily found (since it only requires two chunked-moves--five total moves).
The performance of subjects is compared with a program designed to do a random walk. This is comparable until the goal is less than three moves away. Once a human subject is three moves away from the goal, she/he generally finds a solution.
In order to test the possibility that the exploratory period merely gets the subject to within three moves of the goal, and then the problem is solved, experiments were performed in which the subjects are given hints from which to induce the first two moves. These hints do not help the subjects find the solution in less time. It appears that the exploratory period is essential for finding the final path to the solution.
Finally, experiments were performed in which subjects solved a problem, then are given the same problem with a different starting point (from which the minimal solution path is disjoint from the previous minimal solution path). The subjects performed much better on the second task.
5.3. with Deepak Kulkarni, "The processes of Scientific Discovery: The Strategy of Experimentation (1988)."
The authors investigate the discovery by Hans Krebs of the Ornithine Cycle by which the liver produces urea from carbon dioxide and ammonia. The historical data is based upon an account by Frederick L. Holmes (1980). This is paraphrased in the article. Essentially, Krebs had a "secret weapon"--namely, using the tissue-slice method he had learned while working at a lab. Most researchers were using the perfusion method (involving an entire isolated liver). Krebs discovery occurred in three stages: the first stage involved exploration until the effect of ornithine was noticed, the second stage involved determining the scope of the ornithine effect (that it was specific to ornithine), and the final stage involved discovering the synthesis method (including that ornithine acts as a catalyst).
This discovery is simulated by a production system named KEKADA. The data types (i.e., types of WME's) are processes, substances, experiments, supplementary facts and hypothesis. The general architecture of the discovery system (which works in both a rule space and an instance space) is given. A diagram shows the interaction between modules which choose problems, generate hypotheses, propose experiments, set expectations, and modify hypotheses based upon results. The notion of "surprise" is built into the system--in that when the results of an experiment are outside an expected range, this is noted and used to propose new hypotheses and new experiments. The productions of KEKADA are described in some detail, partitioning them into the families: problem-choosers, problem-generators, decision-makers, experiment-proposers, expectation-setters, experiments (actually, the user supplies the results of experiments to KEKADA), hypothesis-generators, hypothesis-modifiers, confidence-modifiers, and hypothesis or strategy-choosers. The authors also note what knowledge was given to KEKADA.
A log is presented which shows the sequence of hypotheses formed and experiments proposed by KEKADA along the way toward noticing the orthinine effect, determining the scope of the effect, and determining the reactions involved. User interaction is especially heavy in the last stage (in which orthinine is determined to be a catalyst and the intermediate chemicals arginine and citrulline are determined).
Finally, the generality of the system is discussed. The authors classified 31 of KEKADA's productions as "domain-independent" and 33 as "domain-specific." Many of the heuristics were used in different forms by Lenat's AM. Also, Holmes has given an account of another discovery by Krebs (glutamine synthesis). The authors claim that a hand simulation shows that KEKADA should be able to handle this discovery in a similar way to the orthinine cycle discovery.
6.3. with Jill H. Larkin, "Why a Diagram Is (Sometimes) Worth Ten Thousand Words," Cognitive Science, 11:65-99 (1987).
This paper compares sentential with diagrammatic representations on several problems. Two representations are considered informationally equivalent if the same information can be inferred from both, and computationally equivalent if any inference which can be drawn quickly and easily in one can be drawn quickly and easily in the other. Then the point is made that the computational efficiency of a representation depends on the operations available to make inferences, not the notational details.
Programs are partitioned into (1) search, (2) recognition, and (3) inference. In production systems, (1) search corresponds to the number of WME's which must be checked to match conditions, (2) recognition corresponds to the cost of matching a WME to a condition, and (3) corresponds to the action performed by a matched production. The authors claim that diagrams (1) reduce search by localizing information and (2) enhance recognition by making certain relationships explicit (such as the intersection of two lines), but (3) do not affect inference if the information content of the two sets of inference rules is equivalent.
The first example is a physics problem dealing with pulleys. The problem is solved using a sentential representation, then with a diagrammatic representation. The key difference is with respect to search. The diagrammatic representation includes an attentional mechanism which focuses attention on the part of the diagram about which information has just been inferred. In the end, a subsection on "Computational Power in Inference Rules" discusses the notion of using higher-level inference rules instead of lower-level ones from which the higher-level ones could be derived. This discussion seems very relevant to my more general efforts.
The second example is a geometry example involving transversals of parallel lines and congruent triangles. Two significant problems here are that (1) many objects referred to in the problem are not explicitly introduced but are perceptually obvious from the diagram and (2) some higher-level concepts (such as "alternate interior angles") can be defined in terms of perceptual entities, but may not be perceptually obvious themselves. The first problem is overcome in the sentential representation by taking the information given in the original problem and inferring other elements by perceptual productions. The second problem is overcome by writing the productions completely in terms of the lower-level, perceptual elements (using the definitions of the higher-level concepts). Of course, this leads to productions with many conditions to be matched. In the sentential representation, the WME's must be searched linearly to match each of these conditions. In the diagrammatic representation, due to the localization of information, the number of searches is related to the number of locations referenced by the conditions instead of the number of conditions. In this manner, the diagrammatic representation reduces search. More importantly, the diagrammatic representation enhances recognition, since instead of explicitly deriving all the perceptual elements as in the sentential representation, this information is already represented in the diagrammatic representation (though the details of this representation are not given).
Finally, some "artificial" examples of diagrams are briefly considered from economics (supply and demand curves) and physics.