The Many Worlds of Probability, Reality and Cognition: Chapter 4

Concerning Bayesianism
Expert opinion
More on Laplace's rule
A note on complexity

Footnotes
https://manyworlds784.blogspot.com/p/footnotes.html

Concerning Bayesianism
The purpose of this paper is not to rehash the many convolutions of Bayesian controversies, but rather to spotlight a few issues that may cause the reader to re-evaluate her conception of a "probabilistic universe." (The topic will recur beyond this section.)

"Bayesianism" is a term that has come to cover a lot of ground. Bayesian statistical methods these days employ strong computational power to achieve results barely dreamt of in the pre-cyber era.

However, two concepts run through the heart of Bayesianism: Bayes's formula for conditional probability and the principle of insufficient reason or some equivalent. Arguments concern whether "reasonable" initial probabilities are a good basis for calculation and whether expert opinion is a valid basis for an initial probability. Other arguments concern whether we are only measuring a mental state or whether probabilities have some inherent physical basis external to the mind. Further, there has been disagreement over whether Bayesian statistical inference for testing hypotheses is well-grounded in logic and whether the calculated results are meaningful.

The clash is important because Bayesian methods tend to be employed by economists and epidemiologists and so affect broad government policies.

"The personal element is recognized by all statisticians," observes David Howie. "For Bayesians, it is declared on the choice of prior probabilities; for Fisherians in the construction of statistical model; for the Neyman-Pearson school in the selection of competing hypotheses. The social science texts, however, portrayed statistics as a purely impersonal and objective method for the design of experiments and the representation of knowledge" (24).

Similarly, Gerd Gigerenzer argues that a conspiracy by those who control social science journals has brought about the "illusion of a mechanized inference process." Statistics textbooks for social science students have, under publisher pressure, tended to omit or play down not only the personality conflicts among pioneering statisticians but also the differences in reasoning, Gigerenzer says. Such textbooks presented a hybrid of the methods of R.A. Fisher and of Jerzy Neyman and Egon Pearson, without alerting students as to the schisms among the trailblazers, or even, in most cases, mentioning their names. The result, says Gigerenzer, is that: "Statistics is treated as abstract truth, the monolithic logic of inductive inference."

Gigerenzer on 'mindless statistics'
http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf

In the last decade, a chapter on Bayesian methods has become de rigeur for statistics texts. However, it remains true that students are given the impression that statistical inferences are pretty much cut and dried, though authors often do stress the importance of asking the right questions when setting up a method of attack on a problem.

A useful explanation of modern Bayesian reasoning is given by Michael Lavine:

What is Bayesian statistics and why everything else is wrong
http://www.math.umass.edu/~lavine/whatisbayes.pdf

The German tank problem gives an interesting example of a Bayesian analysis.

The German tank problem
http://en.wikipedia.org/wiki/German_tank_problem

In the late 19th century, Charles S. Peirce denounced the Bayesian view and tried to assure that frequency ratios are the basis of scientific probability.

C.S. Peirce on probability
http://plato.stanford.edu/entries/peirce/#prob

This view, as espoused by Von Mises, was later carried forward by Popper (25), who eventually replaced it with his propensity theory (26), which is also anti-Bayesian in character.

Expert opinion
One justification of Bayesian methods is the use of a "reasonable" initial probability arrived at by the opinion of an expert or experts. Nate Silver points out for example that scouts did better at predicting who would make a strong ball player than did his strictly statistical method, prompting him to advocate a combination of the subjective expert opinion along with standard methods of statistical inference.

"If prospect A is hitting .300 with twenty home runs and works at a soup kitchen during his off days, and prospect B is hitting .300 with twenty home runs but hits up night clubs during his free time, there is probably no way to quantify this distribution," Silver writes. "But you'd sure as hell want to take it into account."

Silver notes that the arithmetic mean of several experts tends to yield more accurate predictions than the predictions of any single expert (27).

Obviously, quantification of expert opinion is merely a convenience. Such an expert is essentially using probability inequalities, as in p(x) < p(y) < p(z) or p(x) < [1 - p(x)].

Sometimes when I go to a doctor, the nurse asks me to rate pain on a scale of 1 to 10. I am the expert, and yet I have difficulty with this question most of the time. But if I am shown a set of stick figure faces, with various expressions, I can always find the one that suits my subjective feeling. Though we are not specifically talking of probabilities, we are talking about the information inherent in inequalities and how that information need not always be quantified.

Similarly, I suggest that experts do not use fine-grained degrees of confidence, but generally stick with a simple ranking system, such as {1/100, 1/4, 1/3, 1/2, 2/3, 3/4, 99/100}. It is important to realize that a ranking system can be mapped onto a circle, thus giving a system of pseudo-percentages. This is the custom. But the numbers, not representing frequencies, cannot be said to represent percentages. An exception is the case of an expert who has a strong feel for the frequencies and uses her opinion as an adequate approximation of some actual frequency.

Often, what Bayesians do is to use an expert opinion for the initial probability and then apply the Bayesian formula to come up with frequency probabilities. Some of course argue that if we plug in a pseudo-frequency and use the Bayesian formula (including some integral forms) for an output, then all one has is a pseudo-frequency masquerading as a frequency. However, it is possible to think about this situation differently. One hazards a guess as to the initial frequency -- perhaps based on expert opinion -- and then looks at whether the output frequency ratio is reasonable. That is, a Bayesian might argue that he is testing various initial values to see which yields an output that accords with observed facts.

One needn't always use the Bayesian formula to use this sort of reasoning.

Consider the probability of the word "transpire" in an example of what some would take as Bayesian reasoning. I am fairly sure it is possible, with much labor, to come up with empirical frequencies of that word that could be easily applied. But, from experience, I feel very confident in saying that far fewer than 1 in 10 books of the type I ordinarily read have had that word appear in the past. I also feel confident that a typical reader of books will agree with that assessment. So in that case, it is perfectly reasonable to plug in the value 0.1 in doing a combinatorial probability calculation for a recently read string of books. If, of the last 15 books I have read, 10 have contained the word "transpire," we have ₁₀C₁₅ x (1/10)¹⁰ x (9/10)⁵ = 1.77 x 10^(-7). That is, the probability of such a string of books occurring nonrandomly is much less than 1 in 10 million.

This sort of "Bayesian" inference is especially useful when we wish to establish an upper bound of probability, which, as in the "transpire" case, may be all we need.

One may also argue for a "weight of evidence" model, which may or may not incorporate Bayes's theorem. Basically, the underlying idea is that new knowledge affects the probability of some outcome. Of course, this holds only if the knowledge is relevant, which requires "reasonableness" in specific cases, where a great deal of background information is necessary. But this doesn't mean that the investigator's experience won't be a reliable means of evaluating the information and arriving at a new probability, arguments of Fisher, Popper and others notwithstanding.

A "weight of evidence" approach of course is nothing but induction, and requires quite a bit of "subjective" expert opinion.

On this point, Keynes, in his Treatise, writes: "Take, for instance, the intricate network of arguments upon which the conclusions of The Origin of Species are founded: How impossible it would be to transform them into a shape in which they would be seen to rest upon statistical frequencies!" (28)

Mendelism and the statistical population genetics pioneered by J.B. Haldane, Sewall Wright and Fisher were still in the early stages when Keynes wrote this. And yet, Keynes's point is well taken. The expert opinion of Darwin the biologist was on the whole amply justified (29) when frequency-based methods based on discrete alleles became available (superseding much of the work of Francis Galton).

Three pioneers of the 'modern synthesis'
http://evolution.berkeley.edu/evolibrary/article/history_19

About Francis Galton
http://www.psych.utah.edu/gordon/Classes/Psy4905Docs/PsychHistory/Cards/Galton.html

Keynes notes that Darwin's lack of statistical or mathematical knowledge is notable and, in fact, a better use of frequencies would have helped him. Even so, Darwin did use frequencies informally. In fact, he was using his expert opinion as a student of biology to arrive at frequencies -- though not numerical ones, but rather rule-of-thumb inequalities of the type familiar to non-mathematical persons. From this empirico-inductive method, Darwin established various propositions, to which he gave informal credibility rankings. From these, he used standard logical implication, but again informally.

One must agree here with Popper's insight that the key idea comes first: Darwin's notion of natural selection was based on the template of artificial selection for traits in domestic animals, although he did not divine the driving force --eventually dubbed "survival of the fittest" -- behind natural selection until coming across a 1798 essay by Thomas Malthus.

Essay on the Principle of Population
http://www.ucmp.berkeley.edu/history/malthus.html

Keynes argues that the frequency of some observation and its probability should not be considered to be identical. (This led Carnap to define two forms of probability, though unlike Keynes, he was only interested in frequentist probability.) One may well agree that a frequency gives a number. Yet there must be some way of connecting it to degrees of belief that one ought to have. On the other hand, who actually has a degree of belief of 0.03791? Such a number is only valuable if it helps the observer to discriminate among inequalities, as in p(y) << p(x) < p(z).

One further point: The ordinary human mind-body system usually learns through an empirico-inductive frequency-payoff method, as I describe in Toward. So it makes sense that a true expert would have assimilated much knowledge into her autonomic systems, analogous to algorithms used in computing pattern detection and "auto-complete" systems. Hence one might argue that, at least in some cases, there is strong reason to view the "subjective" opinion as a good measuring rod. Of course, then we must ask, How reliable is the expert? And it would seem a frequency analysis of her predictions would be the way to go.

Studies of polygraph and fingerprint examiners have shown that in neither of those fields does there seem to be much in the way of corroboration that these forensic tools have any scientific value. At the very least, such studies show that the abilities of experts vary widely (30). This is an appropriate place to bring up the matter of the "prosecutor's fallacy," which I describe here:

The prosecutor's fallacy
http://kryptograff.blogspot.com/2007/07/probability-and-prosecutor-there-are.html

Here we run into the issue of false positives. A test can have a probability of accuracy of 99 percent, and yet the probability that that particular event is a match can have a very low probability. Take an example given by mathematician John Allen Paulos. Suppose a terrorist profile program is 99 percent accurate and let's say that 1 in a million Americans is a terrorist. That makes 300 terrorists. The program would be expected to catch 297 of those terrorists. However, the program has an error rate of 1 percent. One percent of 300 million Americans is 3 million people. So a data-mining operation would turn up some 3 million "suspects" who fit the terrorist profile but are innocent nonetheless. So the probability that a positive result identifies a real terrorist is 297 divided by 3 million, or about one in 30,000 -- a very low likelihood.

But data mining isn't the only issue. Consider biometric markers, such as a set of facial features, fingerprints or DNA patterns. The same rule applies. It may be that if a person was involved in a specific crime or other event, the biometric "print" will finger him or her with 99 percent accuracy. Yet context is all important. If that's all the cops have got, it isn't much. Without other information, the odds are still tens of thousands to one that the police or the Border Patrol have the wrong person.

The practicality of so-called Bayesian reasoning has been given by Enrico Fermi, who would ask his students to estimate how many piano tuners were working in Chicago. Certainly, one should be able to come up with plausible ballpark estimates based on subjective knowledge.

Conant on Enrico Fermi and a 9/11 plausibility test
http://znewz1.blogspot.com/2006/11/enrico-fermi-and-911-plausibility-test.html

I have also used the Poisson distribution for a Bayesian-style approach to the probability that wrongful executions have occurred in the United States.

Fatal flaws
http://znewz1.blogspot.com/2007/06/fatal-flaws.html

Some of my assumptions in those discussions are open to debate, of course.

More on Laplace's rule
The physicist Harold Jeffreys agrees with Keynes that the rule of succession isn't plausible without modification, that is via some initial probability. In fact the probability in the Laplacian result of (m+1)/(m+2) after one success is 2/3 that the next trial will succeed by this route -- which, for some experimental situations, Jeffreys regards as too low, rather than too high!

I find it interesting that economist Jevons's use of the Laplacian formula echoes the doomsday argument of Gott. Jevons observed that "if we suppose the sun to have risen demonstratively" one billion times, the probability that it will rise again, on the ground of this knowledge merely, is


(10⁹ + 1)
(10⁹ + 2)

However, notes Jevons, the probability it will rise a billion years hence is


(10⁹ + 1)
(2*10⁹ + 2)

or very close to 1/2.

Though one might agree with Jevons that this formula is a logical outcome of the empirico-inductivist method in science, it is the logic of a system taken to an extreme where, I suggest, it loses value. That is, the magnification of our measuring device is too big. A question of that sort is outside the scope of the tool. Of course, Jevons and his peers knew nothing of the paradoxes of Cantor and Russell, or of Goedel's remarkable results. But if the tool of probability theory -- whichever theory we're talking about -- is of doubtful value in the extreme cases, then a red flag should go up not only cautioning us that beyond certain boundaries "there be dragons," but also warning us that the foundations of existence may not really be explicable in terms of so-called blind chance.

In fact, Jevons does echo Keynes's view that extreme cases yield worthless quantifications, saying: "Inference pushed far beyond their data soon lose a considerable probability." Yet, we should note that the whole idea of the Laplacean rule is to arrive at probabilities when there is very little data available. I suggest that not only Jevons, but Keynes and other probability theorists, might have benefited from more awareness of set theory. That is, we have sets of primitive observations that are built up in the formation of the human social mind and from there, culture and science build sets of relations from these primitive sets.

So here we see the need to discriminate between a predictive algorithm based upon higher sets of relations (propensities of systems), versus a predictive algorithm that emulates the human mind's process of assessing predictability based on repetition, at first with close to zero system information (the newborn). And a third scenario is the use of probabilistic assessment in imperfectly predictive higher-level algorithms.

"We ought to always be applying the inverse method of probabilities so as to take into account all additional information," argues Jevons. This may or may not be true. If a system's propensities are very well established, it may be that variations from the mean should be regarded as observational errors and not indicative of a system malfunction.

"Events when closely scrutinized will hardly ever prove to be quite independent, and the slightest preponderance one way or the other is some evidence of connexion, and in the absence of better evidence should be taken into account," Jevons says (31).

First of all, two events of the same type are often beyond close scrutiny. But, what I think Jevons is really driving at is that when little is known about a dynamical system, the updating of probabilities with new information is a means of arriving at the system's propensities (biases). In other words, we have a rough method of assigning a preliminary information value to that system (we are forming a set of primitives), which can be used as a stopgap until such time as a predictive algorithm based on higher sets is agreed upon, even if that algorithm also requires probabilities for predictive power. Presumably, the predictive power is superior because the propensities have now been well established.

So we can say that the inverse method, and the rule of succession, is in essence a mathematical systemization of an intuitive process, which, however, tends to be also fine-gauged. By extension, much of the "scientific method" follows such a process, where the job of investigators is to progressively screen out "mere correlation" as well as to improve system predictability.

Or that is, a set based on primitive observations is "mere correlation" and so, as Pearson argues, this means that the edifice of science is built upon correlation, not cause. As Pearson points out, the notion of cause is very slippery, which is why he prefers the concept of correlation (32). However, he also had very little engagement with set theory. I would say that what we often carelessly regard as "causes" are to be found in the mathematics of sets.

~(A ∩ B) may be thought of as the cause of ~A ∪ ~B.

Of course, I have left out the time elements, as I only am giving a simple example. What I mean is that sometimes the relations among higher-order sets correspond to "laws" and "causes."

On Markov chains
Conditional probability of course takes on various forms when it is applied. Consider a Markov chain, which is considered far more "legitimate" than Laplace's rule.

Grinstead and Snell gives this example: The Land of Oz is a fine place but the weather isn't very good. Ozmonians never have two nice days in a row. "If they have a nice day, they are just as likely to have snow as rain the next day. If they have snow or rain, they have an even chance of having the same the next day. If there is change from snow or rain, only half of the time is this a change to a nice day."

With this information, a Markov chain can be obtained and a matrix of "transition probabilities" written.

Grinstead and Snell gives this theorem: Let P be the transition matrix of a Markov chain, and let u be the probability vector which represents the starting distribution. Then the probability that the chain is in state S_i after n steps is the ith entry in the vector u(n) = uPn (33).

Grinstead and Snell chapter on Markov chains
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf

Wolfram MathWorld on Markov chain
http://mathworld.wolfram.com/MarkovChain.html

At least with a Markov process, the idea is to deploy non-zero propensity information, which is determined at some specified state of the system. Nevertheless, there is a question here as to what type of randomness is applicable. Where does one draw the line between subjective and objective in such a case? That depends on one's reality superstructure, as discussed later.

At any rate, it seems fair to say that what Bayesian algorithms, such as the rule of succession, tend to do is to justify via quantification our predisposition to "believe in" an event after multiple occurrences, a Darwinian trait we share with other mammals. Still, it should be understood that one is asserting one's psychological process in a way that "seems reasonable" but is at root faith-based and may be in error. More knowledge of physics may increase or decrease one's confidence, but intuitive assumptions remain faith-based.

It can be shown via logical methods that, as n rises, the opportunities for a Goldbach pair, in which n is summable by two primes, rise by approximately n². So one might argue that the higher an arbitrary n, the less likely we are to find a counterexample. And computer checks verify this point.

Or one can use Laplace's rule of succession to show that the probability that the proposition holds for n is given by (n+1)/(n+2). In both cases, at infinity, we have probability 1, or "virtual certainty," that Goldbach's conjecture is true, and yet it might not be, unless we mean that the proposition is practically true because it is assumed that an exception occurs only occasionally. And yet, there remains the possibility that above some n, the behavior of the primes changes (there being so few). So we must even beware the idea that the probabilities are even meaningful over infinity.

At any rate, the confidence of mathematicians that the conjecture is true doesn't necessarily rise as n is pushed by ever more powerful computing. That's because no one has shown why no counterexample can occur. Now, one is entitled to act as though the conjecture is true. For example, one might include it in some practical software program.

A scientific method in the case of attacking public key cryptography is to use independent probabilities concerning primes as a way of escaping a great deal of factorization. One acts as though certain factorization conjectures are true, and that cuts the work involved. When such tests are applied several times, the probability of insufficient factorization drops considerably, meaning that a percentage of "uncrackable" factorizations will fall to this method.

As Keynes shrewdly observed, a superstition may well be the result of the empirical method of assigning a non-numerical probability based on some correlations. For example, when iron plows were first introduced into Poland, that development was followed by a succession of bad harvests, whereupon many farmers revived the use of wooden plowshares. In other words, they acted on the basis of a hypothesis that at the time seemed reasonable.

They also had a different theory of cause and effect than do we today, though even today correlation is frequently taken for causation. This follows from the mammalian psychosoma program that adopts the "survival oriented" theory that when an event often brings a positive or negative feeling, that event is the cause of the mammal's feeling of well-being.

Keynes notes the "common sense" that there is a "real existence of other people" may require an a priori assumption, an assumption that I would say implies the existence of a cognized, if denied, noumenal world. So the empirical, or inductive, notion that the real existence of a human being is "well established" we might say is circular.

Unlike many writers on the philosophy of science, Popper (34) rejected induction as a method of science. "And although I believe that in the history of science it is always the theory and not the experiment, always the idea and not the observation, which opens up the way to new knowledge, I also believe that it is the experiment which saves us from following a track that leads nowhere, which helps us out of the rut, and which challenges us to find a new way."

(Popper makes a good point that there are "diminishing returns of learning by induction." Because lim_{[m,n --> ∞ ]} (m/n) = 1. That is, as more evidence piles up, its value decreases with the number of confirmations.)

A note on complexity
As it is to me inconceivable that a probabilistic scenario doesn't involve some dynamic system, it is evident that we construct a theory -- which in some disciplines is a mathematically based algorithm or set of algorithms for making predictions. The system with which we are working has initial value information and algorithmic program information. This information is non-zero and tends to yield propensities, or initial biases. However, the assumptions or primitive notions in the theory either derive from a subsidiary formalism or are found by empirical means; these primitives derive from experiential -- and hence unprovable -- frequency ratios.

I prefer to view simplicity of a theory as a "small" statement (which may be nested inside a much larger statement). From the algorithmic perspective, we might say that number of parameters is equivalent to number of input values, or, better, that the simplicity corresponds to the information in the algorithm design and input. Simplicity and complexity may be regarded as two ends of some spectrum of binary string lengths.

Another way to view complexity is similar to the Chaitin algorithmic information ratio, but distinct. In this case, we look at the Shannon redundancy versus the Shannon total information.

So the complexity of a signal -- which could be the mathematical representation of a physical system -- would then not be found in the maximum information entailed by equiprobability of every symbol. The structure in the mathematical representation implies constraints -- or conditional probabilities for symbols. So then maximum structure is found when symbol A strictly implies symbol B in a binary system, which is tantamount to saying A = B, giving the uninteresting string: AA...A.

Maximum structure then violates our intuitive idea of complexity. So what do we mean by complexity in this sense?

A point that arises in such discussions concerns entropy (the tendency toward decrease of order) and the related idea of information, which is sometimes thought of as the surprisal value of a digit string. Sometimes a pattern such as AA...A is considered to have low information because we can easily calculate the nth value (assuming we are using some algorithm to obtain the string). So the Chaitin-Kolmogorov complexity is low, or that is, the information is low. On the other hand a string that by some measure is effectively random is considered here to be highly informative because the observer has almost no chance of knowing the string in detail in advance.

Leon Brillouin in Science and Information Theory gives a thorough and penetrating discussion of information theory and physical entropy. Physical entropy he regards as a special case under the heading of information theory (32aa).

Shannon's idea of maximum entropy for a bit string means that it has no redundancy, and so potentially carries the maximum amount of new information. This concept oddly ties together maximally random with maximally informative. It might help to think of the bit string as a carrier of information. Yet, because we screen out the consumer, there is no practical difference between the "actual" information value and the "potential" information value, which is why no one bothers with the "carrier" concept.

However, we can also take the opposite tack. Using runs testing, most digit strings (multi-value strings can often be transformed, for test purposes, to bi-value strings) are found under the bulge in the runs test bell curve and represent probable randomness. So it is unsurprising to encounter such a string. It is far more surprising to come across a string with far "too few" or far "too many" runs. These highly ordered strings would then, from this persepctive, be considered to have high information value because possibly indicative of a non-random organizing principle.

This distinction may help address Stephen Wolfram's attempt to cope with "highly complex" automata (32a). By these, he means those with irregular, randomlike stuctures running through periodic "backgrounds" (sometimes called "ether"). If a sufficiently long runs test were done on such automata, we would obtain, I suggest, z scores in the high but not outlandish range. The z score would give a gauge of complexity.

We might distinguish complicatedness from complexity by saying that a random-like permutation of our grammatical symbols is merely complicated, but a grammatical permutation, taking into account transmission error, is complex.

In this respect, we might also construe complexity as a measure of coding efficiency.

So we know that "complexity" is a worthwhile concept, to be distinguished -- at times -- from "complicatedness." We would say that something that is maximally complicated has the quality of binomial "randomness;" it resides with the largest sets of combinations found in the 68% zone.

I suggest that we may as well define maximally complex to mean a constraint set yielding 50% redundancy in Shannon information. That is, I' = I - I_c, where I' is the new information, I the maximum information that occurs when all symbols are equiprobable (zero structural or propensity information).

Consider two specific primes that are multiplied to form a composite. The names of the primes and the multiplication algorithm. This may be given an advance information value I_c. Alice, who is doing the computation, has this "information," but doesn't know what the data stream will look like when the composite is computed. But she would be able to estimate the stream's approximate length and might know that certain substrings are very likely, or certain. That is, she has enough advance information to devise conditional probabilities for the characters.

Bob encounters the data string and wishes to decipher it. He lacks part of I_c: the names of the primes. So there is more information in the string for him than for Alice. He learns more once he deciphers it than does she, who needn't decipher.

In this respect we see that for him the characters are closer to equiprobability, or maximum Shannon entropy, than for Alice. For him, the amount of information is strongly correlated to the algorithmic work involved. After finding the square root, his best algorithm -- if he wants to be certain of obtaining the primes -- is the sieve of Eratosthenes. This is considered a "hard" computing problem as the work increases exponentially with n.

On the other hand, if Alice wants to compute p_n x p_m, her work increases close to linearly.

A string with maximum Shannon entropy means that the work of decipherment is very close to kⁿ, where k represents the base k number system.

We see then that algorithmic information and standard Shannon information are closely related by the concept of computing work.

Another way to view complexity is via autocorrelation. So an autocorrelation coefficient near 1 or -1 can be construed to imply high "order." As Wikipedia notes, autocorrelation, also known as serial correlation, is the cross-correlation of a signal with itself. Informally, it is the similarity between observations as a function of the time lag between them. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

Multidimensional autocorrelation can also be used as a gauge of complexity. However, it would seem that any multidimensional signal could be mapped onto a two-dimensional signal graph. (I concede I should look into this further at some point.) But, we see that the correlation coefficient, whether auto or no, handles randomness in a way that is closely related to the normal curve. Hence, the correlation coefficient for something highly complex would fall somewhere near 1 or -1, but not too close, because, in general, extreme order is rather uncomplicated.

One can see that the autorcorrelation coefficient is a reflection of Shannon's redundancy quantity. (I daresay there is an expression equating or nearly equating the two.)

When checking the randomness of a signal, the autocorrelation lag time is usually put at 1, according to the National Institute of Standards and Technology, which relates the following:

Given measurements, Y1, Y2, ..., YN at time X1, X2, ..., XN, the lag k autocorrelation function is defined as

r_k =

Σ ^N-k_i=1 (Y_i - Y')
----------------------
Σ ^N_i=1 (Y_i - Y')²

with Y' representing the mean of the Y values.

Although the time variable, X, is not used in the formula for autocorrelation, the assumption is that the observations are equi-spaced.

Wikipedia article on autocorrelation
http://en.wikipedia.org/wiki/Autocorrelation

NIST article on autocorrelation
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35c.htm"

In another vein, consider the cross product A X B of phenomena in A related to B, such that a is a member of A and b is a member of B and aRb means a followed by b and the equivalence relation applies, such that the relation is reflexive, symmetric and transitive.

One algorithm may obtain a smaller subset of A X B than does another. The superior algorithm fetches the larger subset, with the caveat that an "inferior" algorithm may be preferred because its degree of informational complexity is lower than that of the "superior" algorithm.

One might say that algorithm X has more "explanatory power" than algorithm Y if X obtains a larger subset of A X B than does Y and, depending on one's inclination, if X also entails "substantially less" work than does Y.

The method of science works about like the technique in bringing out a logic proof via several approximations. Insight can occur once an approximation is completed and the learner is then prepared for the next approximation or final proof.

This is analogous to deciphering a lengthy message. One may have hard information, or be required to speculate, about a part of the cipher. One then progresses -- hopefully -- as the new information helps unravel the next stage. That is, the information in the structure (or, to use Shannon's term, in the redundancy) is crucial to the decipherment. Which is to say that a Bayesian style of thinking is operative. New information alters probabilities assigned certain substrings.

Decipherment of a coded or noisy message is a pretty good way of illustrating why a theory might be considered valid. Once part of the "message" has been analyzed as having a fair probability of meaning X, the scientist ("decoder") uses that provisional information, along with any external information at hand, to make progress in reading the message. Once a nearly complete message/theory is revealed, the scientist/decoder and her associates believe they have cracked the "code" based on the internal consistency of their findings (the message).

In the case of science in general, however, no one knows how long the message is, or what would assuredly constitute "noise" in the signal (perhaps, a priori wrong ideas?). So the process is much fuzzier than the code cracker's task.

Interestingly, Alan Turing and his colleagues used Bayesian conditional probabilities as part of their decipherment program, establishing that such methods, whatever the logical objections, work quite well in some situations. However, though the code-cracking analogy is quite useful, it seems doubtful that one could use some general method of assigning probabilities -- whether of the Turing or Shannon variety -- to scientific theories, other than possibly to toy models.

Scientists usually prefer to abstain from metaphysics, but their method clearly begs the question: "If the universe is the signal, what is the transmitter?" or "Can the message transmit itself?" Another fair question is: "If the universe is the message, can part of the message (we humans) read the message fully?"

We have a problem of representation when we pose the question: "Can science find an algorithm that, in principle, simulates the entire universe?" The answer is that no Turing machine can model the entire universe.

Conant on Hilbert's sixth problem

http://kryptograff.blogspot.com/2007/06/on-hilberts-sixth-problem.html

The Many Worlds of Probability, Reality and Cognition

Chapter 4

No comments:

Post a Comment

Chapter 10

Report Abuse