Given the exuberance surrounding machine learning (ML) and deep learning (DL) in particular, the claims I will make in this short article will be quite controversial, to say the least. Before we commence, however, we need to clarify what we men by learning. To make this clarification we first need to discuss our use of the phrase ‘I know’. To be sure, our folk use of the phrase ‘I know’ in our everyday linguistic communication is utterly overloaded to the point that it has misguided the most learned amongst us and has caused us to ignore the critical difference in the two major uses of the phrase. For example, here are some common uses of the phrase ‘I know’ in ordinary linguistic communication:
(1a) I know how to play guitar.
(1b) I know how to ride a bicycle.
(1c) I know how to make cheesecake.
(1d) I know how to fix a TV set.
(2a) I know that water boils at 100 degrees Celsius.
(2b) I know that the circumference of a circle is 2πr.
(2c) I know that there are two moons that orbit Mars.
(2d) I know that the larger-than relation is transitive.
Note, first, that ‘I know’ in the set of examples in (1) is followed by ‘how’ while ‘I know’ in the set of examples in (2) is followed by 'that’. In other words, in the first set someone is making a claim about ‘knowing how’ while in the second it is a claim about ‘knowing that’ (see Roland (1958) and Adams (2009) for a discussion of the difference). Moreover, it would seem that when we say ‘I know how …’ we are saying that we know some skill (or more technically, that we have learned some skill). In the set of examples in (2), on the other hand, we are not referring to any skill (that we have learned), but to some facts; that we come to know of! The difference between the two sets of examples is substantial, not withstanding our use of the common phrase ‘I know’. For one thing, when Jon and Sara learn how to play guitar, they do so differently (with varying speed), and the degree to which their skill in playing guitar is developed will always vary. However, when Jon and Sara learn of the fact that ‘water boils at 100 degrees Celsius’ they do so equally, and with the same certainty. In the former case their knowledge (of a skill) is a function of their experience and multi-modal sensory machinery – both of which are individual-specific, while in the latter their individual experience is irrelevant – they are simply told (instructed) about a certain fact. Another major difference between the set of examples in (1) and those in (2) is that whether Jon and Mary learned (and thus come to know how) to play guitar or not is not relevant to how the world we live in works – the universe could not care if I learned how to make cheesecake or not. But the universe we live in will not be same if the ‘larger-than’ relationship was not transitive, or if the circumference of a circle was not 2πr. Thus another major difference between knowing-how and knowing-that is that the former are insignificant, individual, and relative; while the latter are significant, universal, and absolute. Some of the main differences between knowing-how and knowing-that are summarized in table 1 below.
Table 1. Some of the main differences between knowing-how and knowing-that
If the difference between knowing-how and knowing-that is somewhat clear(er), then where does ML – as it is practiced today, fit in? In short, ML is only relevant to ‘knowing-how’ but not to knowing-that. That is, ML as it is practiced today is about learning from data (observations, experiences, etc.) and thus it is not concerned nor can it arrive at ‘factual’ information; that is, information that is absolutely true. The implications of this are substantial, but there are two major implications that must be highlighted, each of which precludes ML, as it is practiced today, from being the right path to AI: (i) factual information are not learnable from data; and (ii) an agent that functions based on a set of universally valid templates must perform reasoning using strongly typed symbolic structures. We will discuss these next.
2. Learning, Error Rate, and Size of the Training Set
We stated above that factual information (e.g., that ‘the circumference of a circle is 2πr’ or that ‘the larger-than relationship is transitive’) are not learned because we are not allowed to learn them differently. But let us imagine for a moment a scenario where humans can, and equally so, learn these factual information. In this fictional universe humans would have to undergo the exact same individual experiences. The only way this can happen is if all humans experience everything that can be experienced – which amounts to observing and experiencing an infinite number of experiences. This is the only way for humans to ‘learn’ from observation the same number of factual (absolutely true) statements. This, of course, is impossible.
Figure 1. Learning with zero error (facts!) requires an infinite number of examples, which of course is impossible.
In fact, learnability theory itself tells us that learning with zero error (as absolutely true facts must be learned) will require an infinite number of examples, as shown in figure 1 above – to approach certainty (zero error), we would need an infinite number of examples. Clearly, this is impossible, and thus factual information are not learned from experience (from data) but are discovered or deduced.
3. Examples of Universally Valid Templates that are not Learnable
Thus far we have argued that ML as it is practiced today – that is, as a learning procedure that learns from data (observations, experiences), is only applicable to finding patterns in data and to inconsequential skills that are not the type of knowledge an intelligent agent would require to function in dynamic and uncertain environments. But then what type of universally valid and un-learnable facts would an intelligent agent require? I will mention here a couple of them, and I hope this should make it clear to the reader what the rest of these universally valid templates might look like.
the template: the container template
structure: an object x, a container object y
associated logic: location(x) = location(y)
weight(y) = weight(y – x) + weight(x)
size(x) <= size(y)
containedIn(x,y) or not containedIn(x,y)
Note that the container template is also similar to one of George Lakoff’s (1990) idealized cognitive models (ICMs). This template defines quite a bit of our metaphorical constructions and is one of the basic templates that is used by children early on in life. Children know, without ever being instructed, that if they place their red ball in some bag and if the bag was put in their room that their red ball is now in their room.
the template: the transitive Xer-than template (X = ‘tall’, ‘big’, ‘small’, ‘heavy’ …)
structure: an object x, an object y
associated logic: Xer(x,y) & Xer(y,z) => Xer(x,z)
Xer(x,y) => not Xer(y,x)
This template describes the logic of many transitive relations, such as ‘larger-than’. Understanding this universally valid cognitive template is also essential to many basic cognitive tasks as well as to fully understanding implicit information conveyed in language. Again, children function at a very young age using this template, without ever being instructed about the logic of the template.
4. Accounting for Universally Valid Templates Requires Symbolic Representation
We discussed in sections 1 and 2 why modern-day neural networks (aka deep neural networks) cannot learn universally valid templates such as the ones discussed above. The technical reason behind this impossibility is quite simple: we cannot learn these templates differently, and so the only way we can (all) see the same observations and learn with zero error (as these facts demand), we would need to process an infinite number of examples, which is of course impossible. But besides this technical difficulty, the logic of these universally valid templates requires the use of type hierarchies and symbolic structures, none of which is admitted by the statistical and associative nature of neural networks. To illustrate, let us re-consider the container template discussed above. The structure of this template is an object x, and a container object y, and one of the basic inferential characteristics is the fact that if x is contained in some object y then x will always assume the location of y, or location(x) = location(y). But what types of objects x and y does this apply to? Do children learn this template separately for different types of objects (say x=ball, y=bag; x=phone; y=briefcase; …) or do we have a typed (and polymorphic) version of this template, such as x : artifact; y=container? All evidence from cognitive science points to having a type hierarchy for efficiency of information storage as well efficiency at reasoning time. But this means admitting symbolic structures and symbolic reasoning, none of which is admitted in ML/DL architectures.
5. Final Remarks
ML as it is practiced now, namely as a data-driven methodology to learn functions from labelled data (supervised version), or as a methodology that uses reinforcement learning or a reward function (unsupervised version), is not concerned with learning most of the universally valid cognitive templates that an intelligent agent would need to function in dynamic and uncertain environments. Failure to recognize this gap is, in our opinion, what hinders real progress in AI (e.g., in autonomous vehicles, that are in essence autonomous agents). Even in the field of natural language processing, where some progress has been made, there’s a long tail that large language models (LLMs) would not be able to capture – a long tail that would hinder employing this technology in applications where extracting factual information or making the right inferences implied in text is of paramount importance (see Saba (2023)).
Before we can speak of intelligent agents that can think and can autonomously function in the world we live in, we must find a way to incorporate so much naïve physics (or commonsense psychology) that seems to be encoded in our DNA. That kind of knowledge, like the universal truth we find in Platonic Forms or the universal truths of mathematics (Grice, 2023), are also universal knowledge that we have incorporated in our genetic code in line with the laws of nature. This type of knowledge is not (and cannot be) learned from data, for we do not have access to an infinite number of examples, nor do we have an infinite amount of time.
I mentioned at the beginning of this article that many newcomers to AI now equate ML with AI. On the other hand, I have argued here that ML, as it is practiced today, is not at all the right path to AI. The schism between these two points of view is substantial, to say the least, which means one of those viewpoints is clearly wrong. I therefore wholeheartedly welcome any feedback that points out where in this article I committed a scientific sin.
Adams, M.P., 2009, "Empirical Evidence and the Knowledge-That/Knowledge-How Distinction", Synthese, 170(1): 97–114. doi:10.1007/s11229-008-9349-z
Grice, M., Kemp, S., Morton, N.J. and Grace, R.C. (2023), The Psychological Scaffolding of Arithmetic, The Psychological Review, June 26, 2023.
Lakoff, George, (1990), Women, Fire and Dangerous Things: What Categories Reveal About the Mind, University of Chicago Press.
Roland, Jane, (1958), On "Knowing How" and "Knowing That", The Philosophical Review, Vol. 67, No. 3 (Jul., 1958), pp. 379-388
Saba, Walid, (2023), Linguistic Phenomena that are Beyond Large Language Models (forthcoming).
Walid Saba is Senior Research Scientist at the Institute for Experiential AI at Northeastern University. He has published over 45 articles on AI and NLP, including an award-winning paper at KI-2008.
No entries found