Ouroboros: Objectivity of the subjective process of interpretation
Ouroboros is the ancient serpent that bites in his own tail, representing reflexivity and cyclicality. It also represents the dilemma of humanities to deal with the era of big data and the fundamental problem of grounding humanities data. More data suggests more evidence but data in humanities seems to be slightly different from data in other scientific disciplines. Besides the fact that the human mind and interpretations are extremely complex phenomena, the creators and the targeted audience of art, text, or other cultural products may no longer be available, and, thirdly, the interpreter is part of the equation.
Interpretation is subjective. This is hardly a topic for discussion in humanities. We interpret signals, situations, and information as personal experiences. We have different motivations, moods, and backgrounds when we interpret signals, even when interpreting the same signal during different observations. In humanities, subjective interpretation is sometimes even a privilege: reading a novel over and over again will be a different personal experience each time you read it. Less clear however, is what this subjectivity implies. To what extent are all our observations interpretations different and therefore subjective? Does subjective mean the same as different or does it mean impossible to control and compare or just complex? How different are our interpretations or how similar or universal are they?
Science and interpretation
Empirical scientists believe that methodological discipline and rigidity can eliminate subjectivity but only if many methodological pitfalls are avoided. One can debate whether it is actually possible to avoid all pitfalls and still be empirical since you have to control for so many variables. Sampling is a statistical method that approximates reality based on probability models, but sampling is doomed to fail if the important variables are not (yet) known, which is often the case in our complex reality and the perception of it. From the perspective of empirical science, subjectivity and interpretation are problems that stand in the way of scientific progress. However, in the case of humanities, psychology, and sociology, subjectivity is not a bug but a feature. The process and result of interpretation tell us something about who we are and what we are.
This adds a dimension to interpretation. Whereas other sciences are mainly concerned with what signals tell us about the world, in humanities it matters what the signal means for the observer or that it means different things for different observers. Studying interpretations of signals or personally interpreting signals? Observing the world or observing observers in the world? These are different scientific objectives. If we limit the discussion to language, the difference can be described as a difference between, on the one hand, scientists who are interested in knowledge and information that is expressed by textual sources, such as historians and social scientists, and, on the other hand, scientists that study the interpretation process itself, regardless of the content. The former want to learn about the world, for which they use, among others, textual sources. The latter want to learn about the interpretation process. For the former, it is a problem when observers come to different conclusions using the same text. For the latter, different interpretations by the same individual or across individuals are an empirical fact and at most a challenging and interesting phenomenon. When the interpretation of data is the objective of study, it no longer matters which interpretation is correct or wrong. Any interpretation is as good as any other interpretation and the more different the interpretation, the more data we have to study the phenomenon. Validity of interpretation is not at stake here.
Making interpretations measurable through annotation
I am interested in interpretation as a semiotic process and specifically the interpretation of language. My ultimate challenge is to build machines that can interpret language in the same way as humans; machines that mimic human behavior. Precisely because machines have no built-in subjectivity, simulating human interpretations means that also subjectivity itself needs to be explicitly modelled. Subjectivity needs to be part of the equation. This is different from saying that subjectivity makes explicit modelling impossible. I believe it is insightful to model subjectivity in an explicit manner by including it in the data model, trying to approximate objectivity of subjectivity.
One way of doing this is through annotations of interpretations. Annotation can be seen as an attempt to make the interpretation explicit and partially replicable. At least we can observe who came to what interpretation and how these annotations agree or differ. Furthermore, annotations can be given to machines. Typically, machines can mimic human interpretations by association of symbols or signals with annotated interpretations of these symbols. The annotation process itself and the capacity of machines to mimic this, sheds some interesting light on the dilemma of interpretation and how to study this. Machines can annotate massive amounts of data and we can observe the effect of this annotation, learn from it, and adapt if flawed. We can endlessly repeat this process and build numerous variations of models to compare; i.e. train different machines based on different human annotations.
We can see annotation as a process to formally encode the interpretation in a database. People can annotate any type of signal; not just text. In the case of humanities, these signals can be archaeological findings, paintings, sculptures, dresses, old texts, but also any communication in the modern digital world (e.g. tweets in social media).
For each signal, the annotation needs to determine the so-called extent and the label that applies. The extent is the portion or aspect of the signal that somebody considers or is looking at. The label represents the interpretation of the extent. In the case of a painting the extent can be many things: the material, the style, the painter, the age, the state it is in, the owner, the depicted objects, the scenery, the symbolic value, the dress of the person, the position of the hands, a certain area of the painting defined in terms of pixels. In the case of text, this can also be the text as a whole, a paragraph, a sentence, a range of words, an expression, a single word or part of a word, or the offset position in a scanned image of a text. The labels that people assign to some extent represent their interpretation. In some cases, these labels are taken from a limited set of values (e.g. the dimensions of a painting, the parts-of-speech of a word, a set of basic emotion labels) but they may also be open, in which case people describe their interpretation freely in words. Obviously, the annotation label itself (e.g. free text) is another signal that is intended to carry some implicit meaning. These annotation signals do not fully describe the meaning of the annotated signal either. They are problematic in three ways. Firstly, any description in any language explaining an annotation, regardless of its length, is by itself just another attempt to communicate or ‘translate’ an experience or interpretation that is in its essence personal, as was called the indeterminacy of translation by Quine (1960). Another problem is that annotators often do not clearly define the extent that they interpret; what they are focusing on. A third problem is that the interpretation is only valid within the specific socio-historic context. If the context is not annotated and not represented, the annotation of some extent can be ambiguous or interpreted wrongly. Finally, the annotator itself as an individual should be defined in the database of annotation as a variable; one should know who made the annotation, the provenance. Knowledge about the annotator may help understanding the annotation.
Since the annotation, however detailed, is always a proxy of the interpretation that is cast in some language, studying the annotations is not trivial either. Anybody searching a database of paintings needs to think of words to match assumed labels assigned by other people. Whenever we find data with matching labels, we will be never sure if our interpretation of the data matches with the interpretation of the annotator.
What can we then say about the value of the annotation itself? Annotations can be done by experts and by what is called ‘the crowd’ in the digital world: an anonymous work force that is willing to carry out simple tasks for a little money. Experts come in two flavors. They can be domain experts, e.g. doctors, lawyers, biologists, political analysts, financial experts, with professional knowledge about the topic of a text; or they can be linguistic experts that may know little about the world the text describes but know everything about linguistic structures, properties, and meaning. Domain-experts tend to rely on their background knowledge and fill in information that is not directly reflected in the signal itself; hence it is not just the signal that is observed, but their background knowledge adds what is not represented. Likewise, domain experts are not very precise in attaching interpretations to the actual signal or portion of the signal that we have called extent. Although the interpretation of the message may be more precise than that of non-experts, this data is less useful to train a machine to simulate the human-interpretation process when reading language or observing a signal. A machine that needs to learn to assign a diagnosis to a medical description needs to connect the labels to, e.g., specific words or an image. Whenever it encounters a new unseen description or image it will compare it with whatever it observed before. On the basis of the degree of matching with its past experience it will pick out the most likely label according to a statistical model. The knowledge of the machine is shallow; the knowledge of the expert is deeper, involving background knowledge.
On the other hand, restricting ourselves to text, language-experts tend to stick to the structure and expressions of the text and assign their interpretation to specific words or ranges of words. Lacking very deep domain-knowledge, they necessarily rely on abstract interpretations and the compositional effect of the syntactic structure that combines words. A downside of annotation by linguists is that they see meaning in each token and they have little background knowledge. Annotations tend to be rather general and abstract and tend to be assigned to portions of text that are not relevant according to an expert but interesting for a linguist. A machine trained on this annotation will behave differently.
The quality of the annotation and, hence, the meaningfulness of the annotation, is usually measured by annotating the data multiple times by more than one expert. From this, the inter-annotator-agreement or IAA is calculated. If the Kappa score1 of the annotation is high (typically above 80 on a scale from 0 to 100), we can infer a number of things: the annotators are reliable and the annotation task is doable. In a sense, we can infer that the interpretation process is reproducible across the group of annotators. The Kappa score then also more or less defines the upper boundary for a machine. We cannot expect the machine to score better than the humans that trained the machine. Note that inconsistencies can occur in the labels selected but also in the extent selected by the annotators. In the latter case, we can allow for partial overlap of the extent. We can also allow for more than one label to apply to the same signal extent or allow for hierarchical labels (e.g. expressing an emotion or a specific type of emotion) and make the system more robust. We typically see that extremely fine-grained labels and precise definitions of the extent lead to low IAA scores and effectively make a task undoable. We can see this as a measurable limit with respect to the level at which we can share meaning and interpretation of signals.
Recently, crowd annotation of data has become very popular. Not only because it is cheaper and faster but also because it provides another view on the interpretation process itself. In the case of crowd annotation, a task such as labelling text is offered to the public, the crowd, for very little money. Instead of getting annotations from a few experts, researchers typically collect many (hundreds of) annotations from the crowd. The response from the crowd is analyzed to find the median response, the deviation, and from these statistics a confidence for the most likely interpretation. The crowd is not an expert in the domain and also not a linguistic expert. However, if the task is defined in a simple and understandable way it has been shown that non-experts can annotate data at least as well as experts. More importantly their annotations (read interpretations) provide more natural data, in a sense. Assuming that any native speaker of a language can at least have a basic understanding of the content, the crowd interpretation is less biased by background information and therefore more bound to what is in the text while not over-sensitive to linguistic structures.
Finally, you are also an annotator since you are an observer. People communicate and blog about their experiences. They spread their opinions and interpretations over social media. Although this annotation is not guided and deliberately orchestrated, its massive scale provides other and new opportunities to obtain data on how we perceive the world, whether this is a book or movie, a hotel, a political situation, or some event.
Whether collected from experts (domain and linguistic experts), from the crowd, or from your blog, the result is a collection of data with observations of the data for which the interpretation is made explicit through labels that are stored as additional data (with varying quality). Machine learning techniques represent the original signal in terms of structural features or properties, for example, the words in the text or the pixels in a painting, and associate the labels with the features. Unseen data, which are signals not seen before by the machine, are represented using the same features. By comparing the features through some mathematical model, the machine decides which examples from the training set come most close to the unseen data, and which labels are therefore most likely to apply. This simple associative approach has been very successful in approximating human labelling for some tasks. The fact that a machine can mimic the labelling does not mean that the machine understands the signal in the same way as humans do, as argued by Searle (1999) in his Chinese Room thought experiment. However, it does imply that the association between the signal or some extent of the signal and the label is strong enough to distinguish it from other signals. Concluding, we can say that annotation as a method and process is not trivial, but it is the best we have. Combining interpretation data through triangulation is a path that we need to explore, where we should try to differentiate data as much as possible to include properties of the observers or annotators in the data as variables, but also strive for deeper and richer interpretations in the future.
Beyond annotation of interpretation
We are all observers. We focus our attention, perceive signals, and establish meaning. What ‘meaning’ is precisely, is yet impossible to define fully. It is an interpretation effect in the brain, where each brain holds a unique collection of data and experience. What is meaning across different human observers? Is this in any way different from the difference between humans and computers?
We can observe behavior of observers assuming they interpret the same signal. For example, we already massively register people observing products and see if they buy it or not. But how many measurements are needed to be able to predict the effect of signals? Even if this can be approximated through models that couple signals to other data and predict the behavior correctly, this is still indirect evidence and there is no proof that the behavior is the result of the same process. If I decide to buy a product after seeing a commercial, it does not mean I have the same motivation or experience as somebody else who buys it.
Nowadays we can also measure signals in the brain while it functions. It is still difficult to measure the brain in natural conditions (you need to pack your head with wired sensors or sit still in a machine) and the granularity of the measured signal is very coarse. Nevertheless, it is amazing to see that our conceptual map, regardless of how sketchy it is now, at least roughly seems to be universal. Mitchell et al. (2008) demonstrated that systems trained with brain activity measures can predict the activation pattern of other people when observing the same word as a stimulus, but also that words outside the training data roughly map to regions close to similar words from the training set. We have a semantic brain and we share, at least to some extent, the semantic space within our brains. However, the fact that activation after perceiving signals in experimental set ups maps across different participants and semantically similar words in scanning experiments is far from mapping complete experiences across brains. Reading the brain and knowing what we think is still not within our reach given the current technology.
We hope that scientific methodology and transparency of experiment to some extent guarantees logical sanity of the observations and interpretations, whichever way we collect the data. Replication and reproducibility of the interpretation are key concepts in this discussion. However complex the interpretation process is, such as reading a novel or experiencing art, somehow making explicit the interpretation through some form of annotation or registration will make it possible to compare one interpretation with another or one effect with another. The route to be followed is not to simplify or reduce the interpretation because of its complexity and the amount of work required to annotate data, but to control the process of interpretation as much as possible and make the result explicit. Again, validity of annotation is not at stake either just as validity of interpretation is not at stake when we study the phenomenon of interpretation. Any annotation is as valid as any other, and the more annotations we have the better. As long as the annotation is a true proxy of the interpretation and not a deliberately false one. In this sense, it is interesting to see how so-called ‘spammers’ (computer programs or people that do not take the task seriously) are eliminated from crowd-annotation tasks by analyzing the annotation behavior and removing unnatural outliers.
- Mitchell, T.M., Shinkareva, S.V., Carlson, A., Chang, K.-M., Malave, V.L., Mason, R.A. & Just, M.A. (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320(5880), 1191-1195.
- Quine, W.V.O. (1960). Word and Object. Cambridge, MA: MIT Press.
- Searle, J. (1999). The Chinese Room. In R.A. Wilson & F. Keil (Eds.), The MIT Encyclopedia of the Cognitive Sciences (pp. 115-116). Cambridge, MA: MIT Press.
- 1.Cohen’s kappa coefficient is a measure for inter-rater agreement. It compares the number of cases that people agree and disagree but also considers the number of choices people have and what would be the agreement by chance. The more labels are allowed the lower the chance on agreement.
© 2009-2020 Uitgeverij Boom Amsterdam
De artikelen uit de (online)tijdschriften van Uitgeverij Boom zijn auteursrechtelijk beschermd. U kunt er natuurlijk uit citeren (voorzien van een bronvermelding) maar voor reproductie in welke vorm dan ook moet toestemming aan de uitgever worden gevraagd:
Behoudens de in of krachtens de Auteurswet van 1912 gestelde uitzonderingen mag niets uit deze uitgave worden verveelvoudigd, opgeslagen in een geautomatiseerd gegevensbestand, of openbaar gemaakt, in enige vorm of op enige wijze, hetzij elektronisch, mechanisch door fotokopieën, opnamen of enig andere manier, zonder voorafgaande schriftelijke toestemming van de uitgever.
Voor zover het maken van kopieën uit deze uitgave is toegestaan op grond van artikelen 16h t/m 16m Auteurswet 1912 jo. Besluit van 27 november 2002, Stb 575, dient men de daarvoor wettelijk verschuldigde vergoeding te voldoen aan de Stichting Reprorecht te Hoofddorp (postbus 3060, 2130 KB, www.reprorecht.nl) of contact op te nemen met de uitgever voor het treffen van een rechtstreekse regeling in de zin van art. 16l, vijfde lid, Auteurswet 1912.
Voor het overnemen van gedeelte(n) uit deze uitgave in bloemlezingen, readers en andere compilatiewerken (artikel 16, Auteurswet 1912) kan men zich wenden tot de Stichting PRO (Stichting Publicatie- en Reproductierechten, postbus 3060, 2130 KB Hoofddorp, www.cedar.nl/pro).
No part of this book may be reproduced in any way whatsoever without the written permission of the publisher.