Javascript Interactive Higher-Order Theorem Prover

Automated Reasoning in Higher Order Logic

Automated Reasoning in Higher Order Logic

Math Gate

Scunak

This page was created and is maintained by Chad E Brown.

Pre 1999 Notes on Machine Learning

I. Deductive Learning

A. Explanation-Based Generalization

Tom M. Mitchell, Richard M. Keller, Smadar T. Kedar-Cabelli, "Explanation-Based Generalization: A Unifying View," Machine Learning 1:1 (1986).

This paper introduces a framework for understanding EBG which partitions the method into:

Goal Concept: Some subset of our "universe" of which we have a characterization in terms of our domain theory.

Training Example: An example of the goal concept.

Domain Theory: The rules corresponding to the domain knowledge which is used to construct "explanations" (or proofs) that an example is a member of the goal concept.

Operationality Criterion: A specification of the form that a learned concept definition must take.

They demonstrate this with some specific examples including a SAFE-TO-STACK example and an example which learns structural criteria for recognizing cups from a functional definition of cups. Both of these examples essentially learn at the "object" level with respect to the logic (which appears to be prolog-style). They also demonstrates EBG with the example of the LEX2 system which learns which operators are useful for solving integrals. This example learns at the "meta" level with respect to the logic.

They also discuss possible combinations of deductive (EBG) and inductive (which they call similarity-based) learning.

M. R. Donat, L. A. Wallen, "Learning and Applying Generalised Solutions using Higher Order Resolution," CADE 9 (1988).

This paper describes generalizations of prolog-style derivations in which the terms are l-terms and unification is higher-order unification. It then describes how higher-order unification can be used to apply the rules extracted from these generalized derivations. The system has been implemented in Paulson's Isabelle system.

One problem noted is that trivial substitutions (substitutions for "function type" variables which ignore their arguments) can result in "useless applications" of generalized rules. That is, the instantiated rule may contain the same l-term in the conclusion and as a member of the premise (along with other premises). Such an application of a rule is clearly useless since it does not reduce the problem to be solved. To solve this problem, they propose to regulate higher-order unification using "filter expressions". A "filter expressions" gives Boolean constraints on which variables are allowed trivial substitutions (for example, if F and G are variables, (OR F G) is a constraint that demands either F or G have a nontrivial substitution).

Scott Dietzen, Frank Pfenning, "Higher-Order and Modal Logic as a Framework for Explanation-Based Generalization," Machine Learning 9:23-55 (1992).

This paper describes lProlog and an implementation of a language loProlog in lProlog. loProlog incorporates the modal o operator which is used to distinguish between the domain theory and the training theory when building a generalization. They give examples of EBG in the higher order context (namely generalizing an integration). Then they describe the implementation of tactics ("named rules") & tacticals ("meta-tactics") in lProlog and how tacticals may be generalized. (They also discuss how the o operator differs from operationality criteria.)

These ideas are also applied to program transformation in an example in which a representation of a "factorial" algorithm is transformation into a tail-recursive form. A generalization of this transformation procedure also applies to a representation of a "reverse" algorithm, so that the generalization puts this algorithm into a tail-recursive form as well. They also discuss, under the heading of "Apprentice Learning", why EBG may be considered more than speed-up learning (since it addresses the problem of intractable search).

Finally, lProlog code is given which implements loProlog with EBG. (The generalization of the derivation is constructed in parallel with the solution.) Because of the modal o operator, both a weak and strong notion of "solve" and "match" must be implemented. In order to implement EBG, they also implement a "boxed-match" and "meta" versions of weak and strong "solve".

Bernard Silver, "Learning Equation Solving Methods From Examples" (1983)

This paper describes a PROLOG program, LP, which learns to solve symbolic equations. The program is given worked examples, from which it can extract new rules, control information (preconditions and postconditions of rules), and meta-control information (plans/schemata for applying methods in a certain order). Plans for solving equations are formulated by computing a "characteristic tuple" (CT) of meta-level characteristics (number of unknown occurrences, etc.) and trying to apply a schema which is judged to be "relevant" to the equation because it was generated from an example with the same CT. Plans are flexibly applied so that if a particular method in the schema does not apply to the problem, another method which results in the desired postconditions may be substituted.

Roberto Desimone, "Learning Control Knowledge: An EBL Approach" (1988)

This paper describes methods for using EBL to learn control knowledge in the following three forms:

1. Applicability of plans and operators

2. Contributions of operators within plans

3. Explicit structure of plans

He classifies existing systems (MACROPS, LEX2, LP, and PET) which learn knowledge related to these three areas.

Precondition analysis (PA) is a technique which has proven useful for the domain of algebraic equation solving. However, he points out problems with PA and suggests solutions.

II. Inductive Learning

A. Decision Trees

Quinlan, "Induction of Decision Trees," Machine Learning 1:81-106 (1986).

This paper describes an algorithm ID3 which constructs a decision tree which splits on each (discrete) value of a given feature. It first describes the induction task. The paper then gives information theoretic methods (namely, the information gain) for choosing which feature to use at the root of each (sub)tree given the appropriate training instances. A method for dealing with noise (the C2 statistical test for independence) and a method for handling missing attributes (a modification of the information gain calculation) are given. One problem discussed is that the selection criteria favors features with several possible values. A possible solution, using the gain ratio, is given.

Mark W. Craven, Jude W. Shavlik, "Extracting Tree-Structured Representations of Trained Networks," Advances in Neural Information Processing Systems, Vol. 8 (1996).

This paper describes a method for constructing a decision-tree from training examples and an oracle which can be used to answer queries (classify new examples). The goal is to construct a simple tree (which is understandable to humans) which maintains a high level of fidelity to the oracle. In the paper the oracle is a neural network trained on the training examples in advance. The constructed decision trees use m-of-n style tests to determine binary splits.

B. Neural Networks

Nillson, "Chapter 4: Some Nonparametric Training Methods for F Machines"

This paper discusses methods for training a TLU (Threshold Logic Unit) to fit training examples (which must be separable by a hyperplane). It describes the solution region of the weight space (dual to the feature space) and gives methods for moving in the weight space (updating the weights) in order to guarantee convergence to the solution region.

Handout from John Hertz, Anders Krogh, Richard G. Palmer. Introduction to the Theory of Neural Computation

The portions of the book in the handout describe the gradient descent learning method for linear units and then generalizes this to obtain the back-propagation algorithm for a multilayer network of nonlinear units. Back-propagation is a method of error correction which works for feed-forward networks. Essentially, an input training pattern propagates signals forward to the output nodes, then the error is calculated for the output nodes (d's). A calculus argument justifies calculating the error (d's) for hidden nodes from the nodes one level higher in the network. Hence, we back-propagate the d's to calculate the d's for the entire network. Then the d's are used to update the weights.

Mark W. Craven, Jude W. Shavlik, "Extracting Tree-Structured Representations of Trained Networks," Advances in Neural Information Processing Systems, Vol. 8 (1996).

This paper presents an algorithm for extracting decision trees from a given neural network. (See Decision Trees)

C. Theory, etc.

"Chapter 2: How to Estimate the True Performance of a Learning System"

This chapter describes methods for estimating the true error rate of a classifier. Testing a classifier on training data gives the apparent error rate, which tends to be optimistic. Methods for estimating the true error rate include train-and-test (splitting the instances into a training set and a testing set), random subsampling (averages over several train-and-test partitions), cross-validation (partitions instances into k sets and trains each possible k-1 and tests on the other, averaging over the k iterations), leaving-one-out (special case of cross-validation in which k=number of instances), and bootstrapping (samples instances with replacement).

J. R. Quinlan and R. M. Cameron-James, "Oversearching and Layered Search in Empirical Learning"

This paper introduces layered search as a technique for avoiding "fluke" theories. A "fluke" theory is a hypothesis which is a good generalization of the data (with respect to the bias of the system) but classifies new data poorly. The more extensive the search through the hypothesis space, the more likely the system is to find such a "fluke" theory. At the same time, a less extensive search may fail to find a good generalization of the data. This is demonstrated in the paper by noting the performance of beam searches of various beam widths on different domains. In many domains, the beam width which yields a generalization with the lowest true error rate corresponds to a compromise between a greedy search and an extensive search. A layered search performs several beam searches with increasing beam widths and uses a probabilistic argument to choose the generalization obtained in one of those searches.

Rich Caruana, "Algorithms and Applications for Multitask Learning"

This paper discusses the idea that an algorithm can often better learn a task if it learns several other, related tasks along with the main task. It is well-known that inductive learning methods must employ a bias in order to generalize from training examples and that no bias can perform well in all domains. This method of learning auxiliary concepts along with the target concept biases the system in a domain-specific manner.

An important theoretical question which must be considered when applying multitask learning is the following: when are two tasks related? More specifically, when do two concepts share some relationship which allows us to learn the two concepts better in parallel than separately? (After examining this question for two concepts, the extension to several concepts should be straightforward.) One obvious criteria is that both the concepts should depend upon the features selected as inputs. Also, two concepts are obviously related if one of the concepts depends directly on the other concept (for example, if one subsumes the other). In such a case, we could presumably learn one concept more easily given the other as an input feature. However, in Section 2.9 of the paper, he notes that it is possible for a concept to be less useful as an input feature than as an auxiliary output to be learned. This suggests that more general relationships (than direct dependency) between concepts allow them to be learned more accurately in parallel. A deeper investigation of these relationships should be beneficial since with a better understanding of these relationships in theory, methods could be derived to select auxiliary learning tasks with some degree of confidence that the auxiliary concepts are related to the main concept being learned.

Leo Breiman, "Bagging Predictors"

This paper discusses a method of building a predictor by "bagging" (bootstrap aggregating). This method bootstraps on the training instances (ie, forms replicate data sets by sampling the training instances with replacement) several times to build several corresponding predictors. Then these predictors are combined into one classifier (for example, by taking "majority vote" for classifiers or by "averaging" for numerical functions). This process is most helpful for "unstable" procedures (those for which replication of data causes large changes in the predictor--for example, neural nets, classification and regression trees), but can degrade performance for stable procedures (such as k-nearest neighbor classifiers). Empirical results are given for classification and numerical problems. Also, some theoretical arguments are given to show why bagging works.

Yoav Freund, Robert E. Schapire, "Experiments with a New Boosting Algorithm"

This paper introduces AdaBoost, a boosting algorithm which takes training examples and a (weak) learning algorithm and produces a classifier by combining several classifiers generated iteratively. Boosting is a modification of bagging which begins by generating a classifier and then on each iteration generates a new classifier which concentrates on learning "hard" examples. Boosting also differs from bagging in that the classifiers are weighted by accuracy when combined to give the final classifier.

One problem with this technique is that by concentrating on "hard to learn" examples, the algorithm may often concentrate on noisy examples. While many experimental comparisons were given in the paper, there were no empirical results given to show the effect of increasing levels of noise in a domain. Bagging, by using random resampling with replacement, would seem to avoid the problem of concentrating on noise. A comparison of bagging and boosting with various levels of noise would have been interesting.

A point of interest in the paper is the comparison between using "error" estimates (which ignore how the example was misclassified) and "pseudo-loss" estimates (which take the misclassification into account). Of course, these ideas coincide unless the classification problem has more than two categories. In the case of a classification problem with more than two categories, the empirical results shown in Figure 3 of the paper clearly show that "pseudo-loss" estimates outperforms "error" estimates. This suggests that paying attention to the nature of a misclassification can lead to more accurate learning.

Kamal M. Ali, Michael J. Pazzani, "Error Reduction through Learning Multiple Descriptions"

This paper discusses learning multiple classifiers for data and combining their results to classify test cases. The classifiers learned in this paper are represented by clauses (with classification at the "head" [in the Horn sense] of the clause), but the arguments appear to apply to other representations as well. To learn several classifiers, instead of choosing the "attribute test" with the best information gain, they create a bucket of "attribute tests" which appear to give a significant information gain and branch possible classifiers by choosing some tests from the bucket. These tests (in the bucket) are called "gain ties." They combine the learned classifiers using four different methods: 1) Uniform Voting, 2) Bayesian Combination, 3) Distribution Summation, and 4) Likelihood Combination.

They show empirically that combining multiple classifiers leads to lower error rates. Also, they argue theoretically and give emperical evidence that lower error rates are dependent upon the correlation of classification errors among the multiple classifiers (the less correlation of errors, the greater the error reduction). Also, they argue that a higher average of gain ties leads to a greater error reduction. The method of combining multiple classifiers does not seem to help with the problem of class noise, but does seem to help with the problem of irrelevant attributes (though not in the limit). Earlier work had theorized that "syntactic diversity" of the multiple classifiers led to greater error reduction, but here they argue that "syntactic diversity" is not enough--what is needed is diversity plus accuracy. In other words, diverse but inaccurate classifiers combine to give inaccurate classifiers.

Ron Kohavi, Mehran Sahami, "Error-Based and Entropy-Based Discretization of Continuous Features"

This paper compares methods of discretizing continuous features by choosing the number and locations of thresholds. The two main approaches used for making these choices are entropy-based (both with and without pruning) and error-based. One limitation of the error-based approach is that it ignores possible interaction between features. That is, by concentrating on minimizing the error of predicting class value given the single feature, the feature values will never be partitioned in such a way that two partitions which coincide predict the same class value. The authors demonstrate in a contrived example why this may be necessary (because of interaction between features). The empirical results given suggest that combining an entropy-based approach with a minimum description length principle stopping condition performs best on the datasets tested.

D. Lazy Learning

David W. Aha, Dennis Kibler. "Noise-Tolerant Instance-Based Learning Algorithms"

This paper first describes instance-based (or, case-based) learning algorithms, which learn a concept by storing selected training instances (using a memory updating algorithm) and classifying test cases using a similarity metric and a classification function (such as nearest neighbor or k-nearest neighbor algorithm). The paper then describes extensions of the algorithms which are noise-tolerant. The extensions keep records of how useful a saved instance has been at classifying later instances and uses statistical tests to decide when to accept an instance or reject an instance as noisy.

Jerome H. Friedman, Ron Kohavi, Yeogirl Yun, "Lazy Decision Trees"

This paper describes an inductive learning procedure which performs classification in the style of a decision tree but is a lazy, incremental learner. The procedure is actually instance-based since the classification of an instance is based upon that instance and saved training instances. However, instead of using a nearest neighbor (or k-nearest neighbor) algorithm, the training instances are used along with the unclassified instance to determine which feature of the instance to test. It then partitions the training instances according to that test, until a majority classification is appropriate.

E. Learning with Queries

Mark W. Craven and Jude W. Shavlik, "Extracting Tree-Structured Representations of Trained Networks"

This paper describes an algorithm, TREPAN, which employs training examples and an oracle to guide the induction of a decision-tree classifier. In particular, the oracle of interest in this paper is a neural network trained on the training examples. As a result, the algorithm constructs a decision tree which approximates the neural network. The decision tree constructed in this manner will generally be more comprehensible to a human than the neural network.

Of particular interest is that TREPAN uses the neural network as a "black box" which allows the algorithm to learn tree structures to approximate any classifier (such as a nearest-neighbor classifier). He notes this in his conclusion but unfortunately gives no empirical results. An interesting evaluation of the algorithm might be to build decision-trees for a variety of classifier representations and compare the relative fidelity to each. Such tests would illuminate which representations are easier to approximate by decision-trees and which are harder.

An interesting application of this algorithm might be to induce efficient approximations (as decision-trees) to more complex classifiers. That is, we may be given a classification algorithm A (without any assumption of its origin) which is slow or requires significant computational resources. TREPAN could be given a randomly generated initial set of training examples (classified by A) and use A as the oracle. Then TREPAN would generate a decision-tree which might provide a faster, more efficient, classifier which approximates A. This assumes we would be willing to sacrifice some accuracy for the sake of speed or efficiency.

Reinforcement Learning

A. Samuel, "Some Studies in Machine Learning Using the Game of Checkers" (1959)

This paper describes a checker-playing program which is used to demonstrate rote-learning and learning-by-generalization (using evaluation polynomials). Rote-learning uses a minimax look-ahead search with a fixed evaluation polynomial to evaluate board positions and then saves these board positions (with clever indexing) with their evaluations if they are thought to be useful. He provides for methods to "forget" board positions which have not been used (hence "refreshed") in enough time. Learning-by-generalization instead of memorizing board positions, changes the coefficients in the evaluation polynomial to try to better evaluate board positions (after abstracting a board position into characteristic values). He compares the two methods and notes that rote-learning appears to perform better on the opening (by memorizing the openings of master players) and on the end-game (by memorizing "traps"). However, learning-by-generalization performs better in the middle-game, presumably because there are so many permutations of board configurations that rote-learning provides a poorer approximation to fitness than the evaluation polynomial.

Gerald Tesauro, "Practical Issues in Temporal Difference Learning"

This paper investigates the application of temporal difference learning to problems with delayed reinforcement. The author concentrates on the temporal difference learning algorithm TD(l), where l is a parameter (with values between 0 and 1) which determines the dependence of the training reward for a particular state in the training sequence on the next state in the training sequence versus the final reward of the sequence. The author discusses a number of practical issues and notes that theoretical considerations have not provided sufficient answers to the questions these issues raise. However, in the example application, that of a backgammon program which learns using TD(l), the author notes that the performance is better than theoretical considerations may suggest.

The author often compares the TD-trained backgammon playing algorithm to an EP-trained (expert player) backgammon playing algorithm. While the EP-trained player tends to outperform the TD-trained player on the test set (see figures 2 & 3), the author notes that the test set may reflect the expert's bias more than game-playing ability. This appears to be the case. The author pits the TD-trained and EP-trained algorithms against the conventional commercial backgammon playing program Gammontool. The TD-trained player outperforms the EP-trained player in this test, if the neural networks used have a sufficient number of hidden units. This suggests that for reinforcement learning problems, temporal difference learning may be more effective than recasting the problem as a supervised learning problem in which training examples have been labelled by an expert.

Robert H. Crites, Andrew G. Barto, "Improving Elevator Performance Using Reinforcement Learning"

In this paper, reinforcement learning is applied to the problem of controlling elevators. This problem is different from other problems to which reinforcement learning has previously been applied (such as games) since the elevator problem operates in continuous time instead of discrete time steps. The problem is further complicated in that each elevator learns its own control strategy while the system as a whole has one reinforcement signal (attempting to minimize the sum of the squares of waiting times). Also, the control of the elevator is constrained to making the decision of whether to stop on a floor or continue. Assumptions are made which control any other decision (such as whether to change direction).

In spite of these complications to the problem, the neural networks after training (via reinforcement learning) used to control the elevators still outperform other standard heuristics for elevator control. Since the reinforcement signal was designed to minimize the sum of the squares of waiting time, the fact that the learned controllers outperformed the heuristics with respect to the sum-of-squares measure should not be too surprising. More surprising is the fact that the learned controllers outperformed the heuristics with respect to other measures including the length of the average wait, the average time spent in the system, and the percentage of clients who had to wait over 60 seconds. This remained true whether all traffic was down or whether some "up" traffic from the lobby was included. Based on the comparison tables, the only exception is the heuristic FIM (Finite Intervisit Minimization) which outperforms the learned controllers with respect to the percentage of clients who had to wait over 60 seconds when the "up" traffic from the lobby was an average of two per minute. It would be interesting if the authors could account for this anomaly.

Thomas G. Dietterich, Nicholas S. Flann, "Explanation-Based Learning and Reinforcement Learning: A Unified View."

This paper compares algorithms which learn improved policies for applying operators to reach goal states. Two of the algorithms use Reinforcement Learning, one uses an Explanation-Based approach, and two combine elements of Reinforcement Learning with elements of Explanation-Based Learning. An advantage of the Explanation-Based approach which is learns about sets of states rather than single states. In the "maze" examples, these sets of states correspond to rectangles which are preimages of an operator and a state (or sets of states). This is especially useful in domains which have too many states to learn about explicitly. An advantage of reinforcement learning is that it learns values associated with states. These values are used to determine the optimal policy. He combines these techniques in two ways which preserves these advantages.

The algorithm rectangle-based offline dynamic programming

(RECT-DP) is essentially reinforcement learning, except that instead of propagating reward values backwards along the particular path of states that led to the reward, it propagates the reward across the preimages of the sequences of operators which led to the reward. In this way, the algorithm learns the value function on a set of states. The algorithm Explanation-Based Reinforcement Learning (EBRL) is essentially Explanation-Based Learning (EBL), except that it uses the policy learned by EBL to construct a value function for the states covered by the policy. In essence, the authors have taken each of the two approaches and attempted to incorporate some useful aspect of the other approach.

Constructing and Selecting Features

Steve Donoho, Larry Rendell, "Constructive Induction Using Fragmentary Knowledge"

This paper discusses different types of fragmentary domain knowledge and how each can be used to support for or against the relevance of constructed features. This support can help direct the search for constructed features which aid prediction. The main example involves constructing features to help predict the bankruptcy of a corporation. The types of domain knowledge used to restrict which constructed features are considered are dimensional analysis (only considering "dollars," "dollars/year," "years," and unitless features), correlation (positive and negative), and normalization. The resulting constructed features outperformed features constructed by experts. This paper provides an important link between knowledge-intensive domain-specific deductive techniques and search-intensive domain-independent inductive techniques.

George H. John, Ron Kohavi, Karl Pfleger, "Irrelevant Features and the Subset Selection Problem."

This paper examines previous notions of the "relevance" of features to a concept and gives an example which shows these notions to be inadequate. The example, "Correlated XOR," is the boolean concept X1 xor X2 with 5 features X1,...,X5 where X4 = not X2 and X5 = not X3. Clearly, X3 and X5 are irrelevant and X1 is indispensable. However, since the concept can be described by either X1 xor X2 or X1 equiv X4, we can dispense with one of X2 and X4 as long as we have the other. This motivates his definitions of the "weak" and "strong" relevance of features. Under his definitions, X1 is strongly relevant (since knowing all the feature values including X1 gives a different probability of an outcome than knowning all feature values except X1), while X2 and X4 are weakly relevant (since it is possible to know some feature values where the probability of an outcome with knowledge of X2[X4] is different from the lack of knowledge), and X3 and X5 are irrelevant. The goal of his feature selection system is to select all strongly relevant features, no irrelevant features, and some weakly relevant features which provide better predictive accuracy.

Previous feature selection systems have used a "filter" model which select which features will be used before applying the inductive algorithm. The authors use a "wrapper" model which uses the inductive algorithm as a block box, along with cross-validation on the training data, to select features. This feature selection can be done in a forward manner (starting with no features and adding those which prove helpful), in a backward manner (starting with all features and eliminating those which do not prove helpful), or in a combination. Once the subset of features has been selected, the induction algorithm is run on the training data (with only the selected features) and tested on the test data. Emperical results comparing this with ID3 and C4.5 are given.

Bayesian Learning

Pedro Domingos, Michael Pazzani, "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier."

This paper demonstrates that the simple (or naive) Bayesian classifier (SBC) is optimal in some cases when the attributes are not independent (given the class). It also argues that the SBC is nearly optimal in many cases in which the independence assumption is heavily violated. This is due to the fact that the independence assumption is only used to estimate the probability of observing an example given the class P(E|C). However, the SBC actually decides the class by maximizing P(Ci|E) with respect to all possible class values Ci. Even if the attributes are not independent (given the class), so that the estimates of P(E|Ci) and hence P(Ci|E) are flawed, the same Ci may still maximize P(Ci|E). That is, the region of the instance space on which the maximum of the true values of P(Ci|E) and the estimated values of P(Ci|E) agree may be large even when the estimates are poor.

The authors also give some necessary conditions for the SBC to be globally optimal (Theorems 3 and 4) and sufficient conditions for the SBC to be globally optimal (Theorems 5, 6, and 7). In Theorems 5 and 6 the authors prove that SBC is globally optimal for learning conjunctive and disjunctive concepts, even though these concepts do not force attributes to be independent (given the class value).