Evaluating the paranormal experimentally
This is a brief discussion of ways to evaluate experimentally whether people are using psi (extra-sensory) channels to gain information. We will assume, for the purposes of this discussion, that all possible ways of cheating (by normal senses, electronics, etc) have been excluded by the experimental design (double blind, etc). Also, the question is put - do ghost have anything to do with psi (ESP, PK, etc)?
Try our online Zener Card test here. It is a simple idea. You have to guess which one of five variants of 'card' (the non-online version comes as a deck of cards) will appear next. It is easy to see that you have a one in five chance of being correct by chance (at least in the online version - with a physical deck of cards it would depend on how many of each card type was left to be picked).
Zener cards are a traditional way of evaluating psi ability. If the cards are chosen randomly (by a computer or from a well shuffled pack) there is no normal way of telling which will come next (though you will get 1 in 5 right by pure chance). So, if you can consistently predict the right card, it implies you may be using a paranormal way (a psi channel) of knowing what it is.
Before you can claim to be psychic, you have to beat random chance. For almost any event imaginable, there is a chance, however small, that it will happen by random luck. It may sound impossible that you can meet some you don't know in the street and tell them their name, birthday and place of birth but it is not. There is a very small chance you can do it by simply plucking a name, date and place from your imagination. The odds against doing it are very high but not infinite.
If you get a circle in the Zener test, what are the odds that the next card will also be a circle? They are still 1 in 5 (assuming an infinite number of 'cards' is available, as in the online version). That's because the identity of the last card does not affect the next one. This makes it easy to calculate the odds of getting a series of cards right. That's because (a) there are only five variants of card and (b) the subject knows what each of them is.
Point (b) may seem trivial but it isn't. If you guessed the next card was a picture of a sword then your chances of being right would be precisely zero, because there is no such card in a Zener deck. The importance of knowing all the possible targets will become more obvious later in this discussion.
Sample size and negative hits
It is a boring fact of statistics that the larger the sample size, the closer your overall score will come to the predicted chance odds. So, if you guess 10 Zener cards and get 3 right it might sound impressive but it isn't, because the sample (10) is very small, statistically speaking. If, however, you guess 100 cards and get 25 right that is impressive (the predicted number you will get correct by chance is 20). There are statistical equations you can use to work out the odds against getting a particular score, given the sample size and the odds, but we are not really concerned with the maths here.
Another important concept is 'negative hits'. If you guess 100 cards and get 15 right, it sounds bad because you should get 20 right even by pure chance. However, it is actually impressive because it implies that you may be able to predict the cards somehow and then 'deliberately' say the wrong one. It is counter-intuitive but very low scores are as impressive as high ones. It is the degree to which your score differs from chance expectation, either way, that is important.
So, if someone consistently beats the odds with Zener cards in controlled conditions, it implies that they may be using psi. It has been found experimentally that subjects who score reasonably well at first often get poorer results over time. Some would argue that this shows that they have no psi ability - that the scores are trending towards chance levels as you would expect as the overall sample size increases (see above). The fact that they did well at first is, in this scenario, mere chance in itself.
However, others argue that the reason for the decline is that card guessing tests are boring. Worse, card guessing is far removed from the circumstances where psi is said to manifest in the real world (such as in psychic readings). So new, more complex, tests have been devised over the years to make the tests more like real life psychic events.
The circumstances where psi is said to act in real life are generally very complex, making them hard, if not impossible, to evaluate satisfactorily. They are 'complex' in that they involve many variables, some of which are difficult or impossible to quantify.
For example, if a psychic gives the correct first name of someone's relative in a reading, what are the odds of them doing that by pure chance? It is almost impossible to say because of the many factors involved, most of which will not be known precisely at the time. For instance, one factor would be how common the name is for the age group and population concerned. Another factor would be often the psychic would think of that name if asked to imagine one randomly. These factors are difficult to quantify or measure.
As it is difficult, or impossible, to quantify such factors, it is impossible to say what the odds are of the psychic getting someone's name right by random chance. We might be impressed by such a feat but we cannot really say how unlikely it is.
It is often claimed that psi operates at an emotional level. Information is passed, maybe via a psychic, using paranormal channels because it is emotionally important to the person receiving the message.
In an attempt to engage the emotions, picture guessing has been tried as an alternative to Zener cards. Pictures can be emotive and far more interesting than five simple designs.
The following is an examination of a fictional picture experiment. It is not intended to be a practical guide for such experiments, as will become obvious. Rather it is a 'thought experiment' to demonstrate certain principles and their problems. Zener type experiments, with fixed possible targets and odds, are called 'forced choice'. The type of experiment outlined below is called 'free response' because the subject can reply with a response they choose.
Instead of five cards, with each design known to the subject (or psychic), five different pictures might be chosen to form the 'target' for the experiment. One of the five pictures is then chosen at random and stared at by someone who tries to 'transmit' its contents to the subject telepathically.
This 'run' may be repeated several times to build a decent statistical sample size (with one picture selected from five possible ones in each run). The subject, in an isolated room, is asked to describe the picture in words and/or drawings. Then, for each run, a judge looks at all five original pictures to decide which most closely matches the description from the psychic. The judge is unaware of which of the five pictures is the actual (randomly chosen) target. Alternatively, the judge may rank all five photos in order, from the best match to the worst.
The chance of getting the right picture by chance is 1 in 5, because there were 5 pictures in the original 'pool' and the target was chosen randomly. So far, so good.
Suppose that, over a number of runs, the subject scores well over the odds of random chance. On the face of it, there is a case for saying psi may have occurred. Or is there?
What is a hit?
Consider the following photo:
You will now be put into the role of the judge of such an experiment. The photo above is the target for a particular run. There are four other photos which were rejected in the random selection process. Here are some some (invented) sample sets of descriptions from subject for this target. Are any or all a good match?
a) " green field, trees, hills behind, lots of blue, clouds, sunny"
b) "house with long drive, lined by trees, car, green bushes"
c) "yacht on the dark ocean, very white sky"
d) "two people playing tennis, houses in the background"
e) "bowl of fruit, mostly apples, white table cloth and ornaments"
f) "stones, plants, rockery, garden, trees"
Note how the descriptions ('free responses') differ in length and detail. Note also how, ignoring the actual real picture itself, you can get a reasonable idea of what the subject saw in their mind.
Description (a) has 'hills' but everything else is wrong. Description (b) has 'bushes' but, like (a), paints an entirely different mind picture compared to the target. Description (c) mentions a 'white sea' and 'dark ocean', which could be taken as resembling (but only slightly) the white snow and rocky mountainside in front. Description (d) cannot be stretched to fit but you might have to choose it anyway if the other targets are even 'further' from it. Description (e) might be stretched if you imagine the mountainside to resemble a jumble of fruit and the snow, a tablecloth- an abstract approach. Description (f) appears to be quite a reasonably hit, in terms of structure, except that there is clearly a big difference in scale between a rockery and mountain.
It is clear that different judges can come up with different decisions about the best match, depending on subjective factors. Does this matter, provided you use the same judge for all such runs? It does if you are trying to obtain some objective measurement of how likely it is the results were obtained by psi. If the experiment was an objective measure of psi it should produce the same result with the same subject, irrespective of who the judge was.
The essence of the problem is that:
- there is typically a lot of information in a picture compared to the description supplied by the subject
- there are variable amounts of information in descriptions by subjects
- the significance of the information in the description can only be judged subjectively
- some judges may accept abstract definitions of words while others may not
- some judges may accept overall appearance (colour etc) as a hit while others may not
- non-matching bits of the description (and picture) are ignored (or should they be 'subtracted'?)
The last point is particularly interesting. If one or two words in a ten word description fit, what happens to the other words? Do we say they don't matter, even though were clearly intended by the subject to be there and don't match with the target picture?
In each of the examples above, the descriptions formed a specific coherent 'mind picture', none of which matched the actual target. If you were asked to draw the picture from the word description alone, most people would come up with similar drawings and none would match the target photo above. Only by selecting certain elements, sometimes in an abstract visual sense (like 'white table cloth' to match 'snowy hillside'), can any of the descriptions be considered a match. It is clear from the descriptions that none of the subjects actually 'saw' a hillside with snow, rocks and bushes in their mind when they were answering.
What if none of the five pictures can be matched up with anything in the description from the subject? The protocol says you still have to choose one of the pictures as a match. If that judgment turns out, by random chance, to be a hit (remember the judge doesn't know which picture is the target) then the subject gets awarded a hit for what is an obvious miss.
What are the real odds?
Earlier I said there was a 1 in 5 chance of guessing the correct picture. But is that true? The subject has no idea what ANY of the original pool of pictures contained at the start of the experiment. The 'sender' only sees and 'transmits' the selected target picture, unaware of the contents of the original pool of pictures.
So, if you assume the subject IS using psi and can 'tune in' to the picture being sent to them, where exactly do the other unused pictures fit in? What purpose do they serve in the experiment as described? All they appear to do is to give an illusion that we somehow know the odds of the subject getting a description of the target picture by chance.
So, do the 1 in 5 odds play a part in judging? By chance you might expect the subject to guess the target 1 in 5 times in a pool of 5 'random' pictures. However, that only works if the pictures are all completely distinctive from each other. In a set of five pictures it is highly likely that two or more will share some similarities, given the large amount of information in most pictures. So the odds of getting a hit by chance are highly unlikely to be 1 in 5. It will, in fact, be extremely difficult to say what the true odds are unless all the photos are selected to be completely different.
In reality, what we are doing in this experiment is saying, can a subject get any elements about a target picture right? In fact, it is not even that. We are saying, can a subject get any elements of a target right, when compared to four other random pictures?
Working out the odds for getting an element correct in a picture, compared to a pool, depends on many factors, some difficult or impossible to quantify. Factors include:
- a picture contains much more information than a word description
- the complexity of the image in the pictures
- how closely the pictures resembles stereotypes
- how different each picture in the pool is to each other
The point about stereotypes is interesting. There are certain stereotype shapes that people tend to draw if asked to make a picture eg, houses, boats, etc. Any picture containing such elements is likely to get more hits by sheer chance.
This type of experiment attempts to introduce 'forced choice' certainty with 'free response' user friendliness. However, saying the odds are 1 in 5, which sounds reassuringly precise, is an illusion in this setup. We have no way of knowing the real odds of getting hits by random chance.
Can the process be fixed?
What could we do to improve the experiment?
Firstly, you could change the judgment process. Instead of saying one picture must be a hit, you could ask judges to allocate scores of 1 to 5, say. Indeed, you might go from -5 to +5 if you want to take into account the amount of incorrect information. However, such a scale would be highly subjective and arbitrary (even if you lay down lots of rules beforehand). You need to decide such matters as: how many ways are there to be 'wrong' about the contents of a picture? Which elements of a picture are central to its identity and which only minor? Do you limit the number of words allowed in the description to keep them comparable? What about different subjects' vocabulary and general knowledge, which could affect their description of the picture? In addition, by abandoning forced selection, the illusory '1 in 5' odds also goes out of the window (you could end up with a negative score) making the maths harder.
What about changing the judges themselves? You could get the subject to also be the judge and let them decide which picture best fits what they saw in their heads. However, though this may make matching more accurate it still leaves process as subjective. Even if you employed many judges for every experiment, it might smooth out the subjective element, but it couldn't entirely remove it. It also cannot fix the basic problem that the '1 in 5' odds are wrong to start with.
So why not show the subject all the pictures in the pool to start with and get them to simply say which one is being sent? Which brings us back to Zener cards!*
In our fictional experiment it is clear that there is a major subjective element, in the judging, where people's personal rules for judging can bias results. Even if judges were removed and a set of predetermined rules used (eg. only a set number of words allowed to describe pictures or abstract interpretations not allowed, etc), this is still subjective. It merely moves the subjective element to the planning part of the experiment. A different person may draw up a different set of rules in the same circumstances. Again, this means the result depends not just on the subject but the experimenter.
The illusion of precision
At the end of our fictional 'picture experiment', we plug in the results we have obtained to grind out statistics. The idea is to work out the odds against arriving at the result by random chance. However, with the various subjective biases in the experiment, as well as the incorrect chance odds to start with, the result is obviously never going to be valid.
Doing the statistics is a purely mechanical process. Once the initial results are plugged in, the conclusions are inevitable, barring calculation errors. However, if the initial results are biased (as they must be in this experiment) then so is the final conclusion. It stats may look impressive, but they are still biased.
One might speculate that such a subjective bias, where the personnel designing or doing the experiment can affect its result, could even produce an experimenter effect.
Real life experiments
The example of forced selection picture guessing above is not something used in real life modern parapsychology experiments. However, it demonstrates the very real problems of measuring the paranormal. It is difficult to eliminate subjective biases and to know the real odds against random chance producing a particular result.
What can be done to 'get beyond Zener cards' without introducing subjective biases and incalculable odds? Firstly, it is important to make the process of recording hits and setting targets as mechanical as possible. Targets need to have a limited set of possible values from the start. This can be disguised from the subject to make the test more interesting. So, for instance, you can try to 'beat' a random number generator by disguising it as a computer video game.
If it is impossible to calculate odds or remove subjective biases, you could use a control. This means inserting random data (or a non-psychic person) to the same experimental process as the subject. The control data (or person) should give you the odds of random chance for your design. The subject then needs to beat those odds.
Firstly, it is important, though difficult, to eliminate subjective elements from parapsychological experiments. These subjective elements include not only using judges to rule on results but also setting arbitrary rules beforehand for a more 'mechanical' approach. If changing who devises the rules for an experiment or who conducts it makes a difference to the results, there must be a subjective bias present.
Secondly, it is vital that the real odds of getting a hit by chance are known. Otherwise, applying statistical methods to the results merely gives an illusion of hard data.
One could argue that other sciences, like psychology, include subjective elements. However, they are not trying to detect psi, a novel extra-sensory channel for information for which there is no hard evidence. We need a higher standard of evidence to demonstrate such a controversial phenomenon.
Experimental design for psi testing is not at all simple. However, with a bit of ingenuity it is possible to eliminate subjective elements while keeping the design emotionally stimulating. It is usually a case of starting with a simple design and then getting other people to criticise it until a better version can be made. And then the process is repeated again and again to iron out all subjective biases and experimental flaws. It can be a painful process but it is well worth the effort.
*ASSAP did an experiment once where each pool of possible target pictures showed a particular kind of building. All the photos were different (unlike Zener cards) but the subjects knew they could only select from 1 of 5 types of building. Any information other than the building type was ignored. This variability of pictures may have made the experiment more emotive while still keeping judging objective and the odds known.
PS: A disturbing speculation
Let us suppose, for the sake of argument, that there was no psi effect. What sort of results would we expect from parapsychological experiments overall? You might think that they would all show 'chance' results ie. no effect outside random odds. However, looking at the type of 'thought experiment' above, this is unlikely. Instead, there would be small deviations from chance due to (a) small sample sizes in some experiments, (b) incorrect calculations of odds and (c) bias unintentionally designed into the judging process.
Further, it is likely that experiments designed by those who believed in psi would tend to show more favourable results than those who did not. This is because of a slight unintentional bias in the judging process. That doesn't mean the judges are biased (double blinding could eliminate that) but the design of the judgment process could subtly favour, quite unintentionally, a positive outcome for those expecting one. Disbelievers could, similarly, unintentionally design in biases towards negative results. This would explain the experimenter effect. Since the number of studies done by people believing in psi tends to outnumber the disbelievers, a small positive bias would emerge.
Another thing you might expect is that as more experiments were done, the overall sum of all the results would get closer and closer to chance. This is because the larger the sample overall, the smaller the deviation from chance, even allowing for biases in some results. If psi is real, this should not happen.
If psi is not real, you could also expect long term unrepeatability of psi in experiments as well as the tendency for positive results to decline with repetition.
What is disturbing is that meta-analyses, statistically summing the results of many psi experiments, reveal pretty much that result - an overall slight result in favour. That is for 'forced choice' experiments (some of which can, as seen above, still have problems with odds calculations and judging bias). The result for 'free response' show a much higher positive result. Given the problems demonstrated in the thought experiment above, that is hardly surprising and in no way reassuring.
Of course, this is only speculation. Real life parapsychology experiments are usually much better designed than the thought experiment above. So a proper study would need to be done (a) to test for 'odds' and 'judgment' biases in real life experiments, (b) to look at the ratio of believer/non-believer experiments and their relative contribution and (c) to predict what sort of bias might be expected if no psi existed.
Also, there may be other reasons to explain such things as non-repeatability, declining results, experimenter effect and so on in psi. The paranormal may be a lot more peculiar than even we imagine.
PPS: The lab / field gap!
As mentioned above, overall results of many psi experiments show a slight positive result. If we take this at face value it would imply that the paranormal should be very rare and unspectacular. Of course, that could mean that the paranormal is very rare spectacular shows (like hauntings) rather than continuous slight effects. Nevertheless, it seems difficult to reconcile such a slight effect with the spectacular field phenomena, like ghosts, that prompted the lab experiments in the first place.
It is possible that people have simply assumed that psi is related to field paranormal phenomena when it isn't. psi might exist but be nothing to do with ghosts! One way to reconcile this apparent contrast is that most apparently paranormal events turn out, on investigation, to be xenonormal. This leaves one remaining mystery: if most paranormal events are xenonormal, is psi even involved in the remaining unexplained portion?
PPPS: How to get real odds - the ideal!
In order to calculate the absolute odds in a paranormal experiment, the following conditions need to apply as a minimum (a 'go' is a single attempt to guess a target - most experiments involve many such identical attempts):
- a fixed set of possible outcomes to any 'go' ('forced choice')
- an equal chance of all possible outcomes in every 'go'
- the subject should be able to distinguish beforehand between every possible outcome
- the result of any particular 'go should not bias the subject's future responses
The design of targets is crucial to any such experiment. A 'target' here is one of several equally likely 'outcomes' (such as a set of numbers from 1 to 5). It is vital that these outcomes are readily distinguishable by the subject. Indeed, it would be a good idea, if practical, to do a 'dry run' showing all possible outcomes to the subject, in advance, to ensure they understand the difference and give a different response. It is possible that certain individual subjects be unable to distinguish between particular outcomes, such as colours or shapes, in the way the experimenter intends. Therefore subjects should be screened beforehand to ensure they each have identical responses to all possible outcomes.
The random choice element MUST apply when the subject is making a choice. It is no use if it is applied beforehand, in the design of the experiment, or afterwards, when it is judged. Unless the subject is faced with a random choice at the time of their 'go', they are not actually being tested!
In some experimental designs instant feedback is provided to the subject. This could bias future responses and so should, ideally be avoided.
It could be said that by repeating experiments many times, we can remove any slight biases that might be introduced by ignoring these ideals with statistics. However, if a bias is systematic, operating continually in one direction, it will not be removed no matter how many times you repeat the experiment. It is better to remove it from the design altogether.
The ideal could be summed up as: the subject always faces an equal, random chance of a fixed number of different outcomes that they can readily distinguish from each other. The further the design moves away from this ideal, the greater the chance of introducing unintentional bias.
© Maurice Townsend 2009