Friday, April 13, 2018

Do Testers Need Bugs?

At this week's Cambridge Tester Meetup we played Questions for Testers, a card game created by James Lyndsay which is intended to "trigger conversations and build connections."

The deck consists of cards containing questions or statements with three responses. We took it in turns to read out a question or statement and the others quizzed us to help them decide which response they thought we'd give. Eventually they'd guess at our response, and we'd reveal it, and then talk about why we'd chosen as we did.

Stefan's choice was the one at the top: 
Bugs and testers are like ...
A. Ants and aardvarks
B. Bees and beekeepers
C. Cars and Cops
Questions, did you say? Boom! Head explosion.
  • Are we mapping bugs and testers to one of the entities in each response? 
  • Does the order of the entities matter?
  • Do they each map to just one? Could they each map to both?
  • What relationships might motivate that mapping? 
  • Does it need to be the same mapping for bugs as testers for a given response?
  • What properties do the entities in each response have?
  • What relationships do we perceive between the entities in each response?
  • What analogous relationship might there be between bugs and testers
  • ... or bugs and testers interpreted however we've mapped them to the entities in that response?
  • When choosing a response, am I evaluating all responses in the same way?
  • What way is that?
  • How is it "the same"? 
  • Does it need to be "the same"?
  • Who says so?
  • In what sense is it an evaluation anyway?
  •  ...

It was fun listening to the others ask questions and wondering what kind of hypotheses they were testing with them.

It was interesting that we didn't much discuss what we thought we were discovering by the responses given, even though there's no reason why we shouldn't have. (But the game isn't set up as a collaborative effort in that way: the rules award points to individual players, although also says they are pointless, and we weren't following the rules anyway.)

We found that asking how the reader would be approaching their answer was a very productive question: it can cut through several layers of assumption on the questioners' side.

There were lots of good questions, but my favourite was one of a series intended to test the kinds of relationships that Stefan perceived between bugs and testers:
Do testers need bugs?
Do. Testers. Need. Bugs? Boom!

Saturday, April 7, 2018

Testing in the Abstract, Again


It reminded me once more of Harnessed Tester asking whether he could switch his testing off.  I mean, I never intended to start testing when I began reviewing those abstracts. All I wanted to do was think about a strategy for reviewing conference submissions, implement it, and reflect on it. That's all. Honestly. But here I was, testing. Again.

I'd been asked to review submissions for a conference. On the one hand it was a tiny bit flattering to be asked, I was certainly interested to see what goes on behind the scenes, and I thought I might learn something about submitting proposals of my own in future.

On the other hand, I'm not naive enough to think that I was being asked because of who I am rather than because they needed eyeballs, a large number of reviews were being asked for, and the deadline was just four days away. But I decided to go for it, on the basis that it was a novel experience for me.

Limited time, a new task, little context: not unusual parameters for a project. I carved out a bit of time for preliminary thoughts on the way I wanted to begin to attack the problem, and asked things like what factors could I predict that I might be concerned about, how might I prepare for that, and how might I mitigate any risks that I identified?

I like to reflect on how I worked and how I felt about it, so when I began reviewing I made notes as I went. It wasn't long before I realised that I was noting down observations not just about myself but about the conference review process I was working in, the usability of the reviewing application, and the compromises I was making because of those things.

Testing, you see. Again.

Let's start where I started. What were some of the potential problems I could foresee with reviewing?

  • Calibration of my scoring: Both with myself and with other reviewers. I hoped that the provided guidelines would help me to score consistently (enough for the organisers' needs) with other reviewers, and I hoped I could be consistent with myself across reviews.
  • Not enough time: I hadn't been given any guidance on the amount of time to spend reviewing each submission and didn't have much intuition about it. To get through the set I'd been allocated would take several hours at 5 minutes each, plus any prep time and the time to record my views. I wanted to give all candidates a fair crack of the whip.
  • Too personal a view: I think of myself as a bit of a generalist but, like us all, I've got my preferences. For example, I really enjoy the mental exercise that theoretical talks offer. The meta-er the better-er! I've also been around for a while and done, seen, and read a lot of stuff. I might easily be jaded, or too easily assume that some topic that I'm familiar with and totally over already is not interesting to others.

The conference organisers supplied very readable review guidelines. They stipulated that each review was to include numerical scores in several dimensions (including whether I'd want to see the talk myself, how well it fitted the theme etc) and some free text comments to include both reaction and reflection. I found it helpful that there were examples of factors that would motivate both high and low scores.

My first impressions of the web application for managing reviews were less warm. It was a single page with submissions identified only by numbers. Clicking through showed the proposal text and some other data, space for my review and a "submit review" button. There was no indication that I could submit and then edit a review if I wanted to. This bothered me with respect to being self-consistent across the whole set.

When viewing a submission I was unable to see fields for the submitter's name or organisation. I think this is right, and fair.  However, I found it was often trivial to identify the submitter, or that their name was just in plain sight in some other field. Some submissions that I reviewed had additional material attached which gave contact details, and a field which asked for evidence of speaking experience frequently contained links to YouTube or blogs.

As the review guidelines made no mention of assessing speaker suitability or perceived ability, much of the potential for conflict could have been removed by showing reviewers only the title and text of the abstracts.

To give myself a chance to tune my reviewing radar, I decided to pre-review a few proposals outside of the web application. I might have done it in the application if I'd been sure I could go back and edit.  I randomly chose 15% of the submissions from my list in the hope of seeing something of the range of proposals that I might encounter, be more confident that I'd review fairly and consistently across the whole set, and get an idea of how much time I might need to spend in total.

It was a nice idea, but it didn't work on this occasion because I found that I was largely unimpressed by all of them. What to do? Worry that my standards were too high? Worry that my tastes were too esoteric? Worry that I wasn't the target audience for this conference? All possible, but instead I decided that I'd accept that I was consistent on this set, and use it as my baseline, a yardstick for what was to come in terms of reviewing standards and also time per review.

Then I started at the top of the list of submissions and began reviewing. After clicking "submit" on the first review, I resolved my question about whether editing was possible: it was. Would have been nice to have known, but ...

It would also have been nice if the text of the submissions hadn't been stripped of formatting so that bullet lists and paragraphs ran into each other in a blob of messiness. Ironically, having found that I could edit my reviews, I also found that, when I did, the app put HTML markup in the text box along with my text which made it harder for me! Hopefully this makes it easier for the reviewers of reviews to read.

As I found myself flicking back and forth frequently between the review guidelines and the scoring panels, I thought how nice it would be to have the guidance there in the page for me, in the place I need it. Likewise, I was surprised to be asked to judge workshops given that the review guidelines were specifically for presentations. As the same scoring options were presented, I used the same criteria.

Although I liked the guidelines a lot, and did my best to review according to them, they still left room for subjectivity. I don't think there's much scope for avoiding that so I decided that I was prepared to trust my own preferences, and that they were what the organisers wanted, given that they had asked me.

I felt a surprisingly strong sense of relief when I found a few proposals that just stood head and shoulders above everything else I had read: something with depth, something that I could truthfully say that I would want to go and see on the basis of the short description only. It gave me some confidence that I hadn't been too harsh with my earlier low grades. Some confidence ...

I said at the top that my initial mission here was to review for myself how I'd gone about reviewing and I noted three particular concerns. How do I feel I did against them?

  • Calibration of scoring:  I would have liked to have been able to use the app to compare the set of grades that I had given but there was no obvious way to achieve that. I was discouraged from deeper comparison across reviews because of the interface: having to remember which number represented which abstract in order to click through and see my grading was tiresome. 
  • Not enough time: my preliminary reviews helped to give me reassurance that I could do a fair job in the time I had available. 
  • Too personal a view: I came to think that it's on the organisers to choose the reviewers they think will represent the audience they want to attract in whatever ways are important to them, in the context of the review guidelines.

But, as we've seen, I ended up thinking about more besides. You'll have noticed too, I guess, that there's no mention of my asking any questions of the organisers. In a real world testing task I would have asked the stakeholders for assistance in giving them what they wanted, for clarification. In this particular world, however, with a very tight deadline and the reviews being done over a weekend I didn't have that luxury.

So this is an experience report, based on no prior experience of this kind of process. I wrote it some time ago, and since then I've reviewed conference proposals another couple of times. The reflections here helped me in the task, and the writing helped that reflection.

I've reflected again on whether I want to turn testing off, as Harnessed Tester suggested. And, again, I don't think that I do. I enjoy the feeling of questioning what I'm doing, and why I'm doing it, and why it matters. I also like the practice of keeping that stuff in check while accomplishing my mission, and working out what it's appropriate to report, when, and to who. If I'm honest, I actually smile when I find myself testing, again.
Image: https://flic.kr/p/gdc5W4

I haven't named the conference I'm talking about here because it isn't intended as any particular comment on them.  I shared my longer thoughts, including specific details not included here, with the organisers some time before I published this post.

Thursday, March 15, 2018

Testing For Me




I spoke at UKSTAR 2018 this week, an eight-minute talk in a Storytelling track. This post is a prettied-up version of the notes I made for it along with some of the slides. The full slide deck is in the Media page.

My story is called The Anatomy of a Definition of Testing. It's not a suspense story though, so I'll give you the definition right up front:
Testing is the pursuit of relevant incongruity.
That is, for me, testing is the pursuit of relevant incongruity. But how did I get there?

Well, the journey started with Explore It! by Elisabeth Hendrickson, a great book about exploratory testing which has, near the beginning, this definition:
Tested = Checked + Explored
It's finessed a little by
Neither checking nor exploring is sufficient on its own
and the idea that testing is to
... interact with the software or system, observe its actual behavior, and compare that to your expectations.
Interestingly, the definition doesn't really play a significant part in the rest of the book, but as I was reading I kept on returning to it, asking myself questions like these:


Eventually I thought I'd ask Elisabeth whether she could help me out (she said she would!) and we exchanged a few emails, and that helped me to clarify my thoughts. But during the conversation, I began to wonder was this process, of thinking about testing, itself testing? I mean, I was doing things that were consistent with things I do when testing:


But was I really testing? By Elisabeth's definition, I wasn't sure that I could say I was. But it felt a lot like testing. So I went looking for other definitions and found loads!


And I recognise aspects of all of those in my testing, but none of them capture all that testing is for me. Reflecting further, I remembered a talk that Rikard Edgren gave at EuroSTAR 2015 where he said this beautiful thing:
Testing is simple: you understand what is important and then you test it.
Adam Knight has talked and written about Fractal Exploratory testing, and describes it like this:
as each flaw ... is discovered ... [a] mini exploration will result in a more targeted testing exploration around this feature area
To me, they're both talking about how investigation of one area leads to investigation in a subset of that area, and in a subset of that area. A kind of traversal through the testing space where the actions that are performed at any level are similar. And I recognise that too, but it's not all that testing is. For me.

I tried to draw my testing on a feature. It looked like this.


Sometimes multiple activities feed into another. Sometimes one activity feeds into multiple others. Activities can run in parallel, overlap, be serial. A single activity can have multiple intended or accidental outcomes ...

I tried to draw it again. It looked like this:


A vision of an attempt to somehow keep a lid on the ambiguity, and the unknowns, and the complexity in order to be able to get on and test.

A colleague pointed me at a drawing by John Bach of his testing.


That scribble on the left is not necessarily confusion and chaos, but cycling, cross-checking, confirming, until a course of action that seems propitious can be identified and followed out to the right with occasions when exploration goes deep down. And, naturally, I recognise that in my testing too. But it isn't all that testing is for me.

So, without a definition but with a lot of thoughts about what I wanted from my definition I tried to list factors of testing and then come up with a definition that covers them all.


And I thought about it for a good long time. I mean, really long. And eventually out popped my definition:
Testing is the pursuit of relevant incongruity.
But that's a bit of a mouthful. Let's unpack it a little:


Incongruity: Oxford Dictionaries define this as "not in harmony or keeping with the surroundings". I interpret lack of harmony as a potential problem and lack of keeping as an actual problem, and those two interpretations are interesting and useful in testing.


Pursuit: Again, there are two senses that capture important aspects of testing for me. You can pursue something that you don't know is there and that you may never find, like a dream. Or you can pursue the solution to a problem that you know you have, that's right in front of you.

Why not some other verb? For me, an investigation has identified something to investigate, finding requires something to be found and I want to be able to say I tested even if I found no issues, exploration can work but I don't want my definition to be thought of as a definition of exploratory testing, much as I love it.


Relevant: if this work doesn't matter to anyone, why are we doing it? Whoever that is can help us to understand whether any incongruities we identify are valuable to them, relevant to the project.

So that's my definition:
Testing is the pursuit of relevant incongruity.
Notice that it says nothing about particular techniques or methods, or products, or systems. It exists, deliberately, in a space where it must be applied in context, at a time, for a purpose.

But what do I do with it?

Well, any time I'm working I can ask myself whether what I'm doing is contributing the pursuit of relevant incongruity. If it is, I'm testing — I still have the question of whether I'm testing the right things, at the right time but that's a different problem for another story.

If I'm not in pursuit of relevant incongruity I can ask whether I should be, and why. Sometimes it's legit to take off your tester hat and do something else, project housekeeping for example, because it needs doing, because you've got the skills, because you can be most efficient, because it's your turn, or whatever. But sometimes it can provoke me into thinking that I'm not doing what I should be.

Which is great, and I buzzed along very nicely using it. And then I heard Michael Bolton say this on the Quality Remarks podcast:
The goal of testing is identifying problems that matter
And I thought "Hello! That feels pretty familiar!" Although "problems" loses the subtlety of "incongruity", and "finding" I already said I have some reservations about, and note that he's talking about the "goal" of testing, not testing itself. But still, there's a similar sentiment here, and look how much snappier it is than mine!

So I asked him about it, and he said "many things stay in the mind more easily when they can be expressed concisely and snappily."

Which is true to my experience, and also very useful, because it emphasises his different need. He's a teacher, and he wants to share his description and have his pupils remember it. And that's OK, it keeps me humble: I shouldn't impose my view of testing on others, and other views can be just as valid.

And so that's my definition:
Testing is the pursuit of relevant incongruity.
It's my definition to help me do what I need to do in my context. It's for me. But I hope it was of some interest to you...
Image: Jit Gosai (via Twitter)

Wednesday, March 14, 2018

Pen and Storyteller


The latest in my occasional series of experiments in sketchnoting, this time from UKSTAR 2018. The sketchnote/video thing I did as a promo for my own storytelling talk is still available here


Friday, March 9, 2018

Decision By Precision?


I2E, the flagship product at Linguamatics, is a text mining engine and so sits in the broad space of search tools such as grep, Ctrl-F, and even Google. In that world, evaluating "how good" or "how relevant", or the "correctness" of a set of search results is interesting for a number of reasons, including:

  • it may be hard to define what those terms mean, in general cases.
  • it may be possible to calculate some kind of metric on well-understood, small, data sets but less so at scale. 
  • it may be possible to calculate some kind of metric for simple searches, but less so for complex ones.
  • on different occasions the person searching may have different intent and needs from the same search.

But today we'll concentrate on two standard metrics that can be easily defined and which have agreed definitions: precision (roughly "how useful the search results are")  and recall (roughly "how complete the results are").

Imagine we want to test our search engine. We have a set of documents and we will search them for the single word "testing". The image below, from Wikipedia, shows how we could calculate the metrics.


There's a lot of information in there, let's unpack some of it:
  • The square represents the documents. 
  • The solid grey circles are occurrences of the word "testing".
  • The clear grey circles are occurrences of other words.
  • The central black circle is the set of results from the search.
  • The term positive means that a word is in the results.
  • The term negative means that a word is not in the results.
  • The term true means that a word is classified correctly.
  • The term false means that a word is classified incorrectly.
Let's overlay some numbers to make it clear. We inspect the documents (outside of the SUT) and find that there are 1000 words of which 100 are occurrences of the word "testing" (these are the solid grey circles.)

We run our search using the SUT and get back 50 results (the central black circle). We inspect those results and find that 35 are the word "testing" (the true positives) and 15 are something else (the false positives - asserted to be correct, but in fact incorrect).

The pictographs at the bottom of the image give us the formulae we need: precision comes only from the set of results we can see, and in this case is 35/50 or 70%. Recall requires knowledge of the whole set of documents, and for us is 35/100 or 35%.

A striking difference but which is better? Can one be better? These things are metrics, so can they be gamed?

Well, if the search simply returned every word in the documents its recall would be 100/100, or 100%, but precision would be very low at 100/1000, or 10%, because precision takes the negative content in the search results into account.

So can you get 100% precision? You certainly can: have the search return only those results with an extremely high confidence of being correct. Imagine only one result is returned, and it's a good one, then precision is 1/1 or 100%. Sadly, recall in this case is 1/100 or 1%.

Which is very interesting, really, but what does it have to do with testing?

Good question; it's background for a rough and ready analogy that squirted out of a conversation at work this week, illustrating the appealing trap of simple confirmatory testing. Imagine that you run your system under test with nominated input, inspect what comes out, and check that against some acceptance criteria. Everything in the output meets the criteria. Brilliant! Job done? Or precision 100%?

Images: https://flic.kr/p/sqtWUTWikipedia

Saturday, March 3, 2018

Better Averse?




What is testing? In one of the sessions at the Cambridge Software Testing Clinic the other night, the whole group collaborated on a mind map on that very question.

I find it interesting how the mind can latch on to one small aspect of a thing. I attend these kinds of events to find something new, perhaps a new idea, a different perspective on something I already know, or something about the way I act. In this case, under a node labelled mindset, one of the participants proposed risk-averse. I challenged that, and counter-proposed risk-aware. You can see them both on the map on this page, centre-left, near the bottom. And that's the thing I've been coming back to since: accepting that there is a testing mindset (with all the sociological and semantic challenges that might have) is it reasonable to say that it includes risk aversion or risk awareness?

Let's start here: why did I challenge? I challenged because the interpretation that I took in the moment was that of testers saying "no, let's not ship because we know there are problems." And that's not something that I want to hear testers who work for me saying. At least not generally, because I can think of contexts in which that kind of phrase is entirely appropriate. But I won't follow that train of thought just now.

But do I even have a decent idea what risk-aversion is? I think so: according to Wikipedia's opening sentence on the topic risk-aversion "is a preference for a sure outcome over a gamble with higher or equal expected value." In my scenario, not shipping is a more certain outcome (in terms of risk that those known problems will be found in the product by customers) than shipping. But I can think of other aspects of shipping that might be more certain than in the case of not shipping. Perhaps I'll come back to them.

But then, do I have a decent idea what the person who suggested risk-aversion meant by it? In all honesty, no. I have some evidence, because they were keen on risk-awareness when I suggested it, that they overlap with me here, but I can't be sure. Even if they were with me right now, and we could spend hours talking about it, could we be certain that we view this particular conceptual space in the same way? A good question for another time.

So what did I mean by risk-aware? I meant that I want testers who work for me to be alert to the risks that might be present in any piece of work they are doing. In fact, I want them to be actively looking for those risks. And I want them to be able to characterise the risks in various ways.

For example, if they have found an issue in the product I'd like them to be able to talk to stakeholders about it, what the associated risks are, why this matters and to who, and ideally something about the likelihood of it occurring and something about impact if the issue was to occur. I'd also like my testers to be able to talk to stakeholders in the same way about risk that they observe in the processes we're using to build our product, and also in the approach taken to testing, and in the results of testing. If I thought harder could I add more to this? Undoubtedly and perhaps I will, one of these days.

While we're here, this kind of risk assessment is inherently part of testing for me. (Perceived) risk is one of the data points that should be be in the mix when deciding what to test at any given time. Actually, I might elaborate on that: risk and uncertainty are two related data points that should be considered when deciding what to test. But I don't really want to open up another front in this discussion, so see Not Sure About Uncertainty for further thoughts on that.

Would it be fair to say that a tester engaging in risk-based testing is risk-averse to some extent or another? Try this: through their testing they are trying to obtain the surest outcome (in terms of understanding of the product) for the stakeholders by exploring those areas of the product that are most risky. So, well, yes, that would seem to satisfy the definition from Wikipedia wouldn't it?

Permission to smirk granted. You are observing me wondering whether I am arguing that a tester who is risk-aware, and uses that risk-awareness to engage in risk-based testing, must necessarily be risk-averse. And now I'm also thinking that perhaps this is possible because the things at risk might be different, but I time-boxed this post (to limit the risk of not getting to my other commitments, naturally) and I've run out of time.

So here's the thing that I enjoy so much: one thought blown up into a big claggy mess of conceptual strands, each of which themselves could lead to other interesting places, but with the knowledge that following one means that I won't be following another, with an end goal that can't be easily characterised, alert for synergy, overlap, contradiction, ambiguity and other relationships, and which might simply reflect my personal biases, with the intellectual challenge of disentanglement and value-seeking, all limited by time.

Hmm. What is testing?
Image: Cambridge Software Testing Clinic

Saturday, February 24, 2018

Transforming Theory and Practice




When Sneha Bhat asked if I'd present with her at CEWT #5 the talk we produced was Theoreticus Prime vs Praktikertron. In this essay we've tidied up the notes we wrote in preparation and included a few of the sketches we made when we were developing our model. The title comes from the Transformers we gave the participants at CEWT to explore in an attempt to illustrate different kinds of theory being discovered and shared.

CEWT #5 asked this question: theory over practice or practice over theory? It's an age-old conundrum, represented in popular culture by memes like these that you would have seen as you avoid both theory and practice by grazing on social media when you should be working:
In theory, there is no difference between theory and practice. But, in practice, there is. (Wiki)
Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. In our lab, theory and practice are combined: nothing works and no one knows why. (Crazy Proverbs)
In the way that all communities look for another to kick against — an outgroup to their ingroup — it’s not hard to find instances of theorists saying that practitioners don’t know why they’re doing what they do, and practitioners saying theorists couldn’t do it if they had to. But let’s not be drawn into this kind of bickering, Instead, let’s step back and try to draw out some distinction between these terms, between theory and practice.
A theory seeks to explain observations. In science it tends to be interpreted as being backed up by a weight of evidence, but in casual conversation this isn’t the case.

Practice also has a couple of primary senses, but both of them are strongly about doing, about activity: either repeatedly exercising a skill, or applying some idea.
So theory is some kind of thing, a thing that is produced, whereas practice is an activity, perhaps an engine for production. That already suggests an intriguing possible relationship, one where practice might drive the generation of theory. But how else might practice relate to theory? You don’t have to look hard to find suggestions and we’ve picked out just three that piqued our interest when we were researching our CEWT talk.
Pete Walen, a tester, writing in Considering Conferences: A Reflection, argues for a causal relationship, that theory only has value if it changes practice.

W. Edwards Deming, in The New Economics, wants theory to invoke practices that will confirm it or contradict it, and either way, increase the sum total of theory.

Steve Klabnik, a well-known thinker and contributor to the Ruby and Rust communities, thinks that the key is finding people who can explain the value of theory to practitioners and get practitioners to point out its flaws to the theorists.
Our intuition is that there’s truth in all of these perspectives, and we want to capture it in a model.  Assuming that theorists and practitioners can be convinced of the value of each other’s contributions, the primary aspects are:
  • Theory should guide practice.
  • Practice should help to refine theory.
And if that looks like a loop to you, it does to us too. But where should we enter it? Jack Cohen and Graham Medley provide a suggestion in their book, Stop Working and Start Thinking:
This cannot be said too often or emphasised too much. Ignorance, recognised, is the most valuable starting place
So we imagine some kind of process for accomplishing a goal which goes like this:
  • Do I have theory that gives me what I need?
  • If yes, then stop.
  • If no, then practice to generate theory.
  • Repeat.
And that seems interesting, maybe even plausible, but perhaps lacking in some important respects. For example, we’ve talked about both theorists and practitioners but while the latter are represented in this pseudocode the former are not. Where might they fit?

Our proposal is that they are present. Our proposal is that what you might call theorists, in fact, are practitioners. We’ve said that theory is a thing and practice is an activity. Theorists think about things. Thinking about things is an activity. Ergo, theorists are practitioners! They might not be working with tangible objects like a plumber would, but then perhaps neither are software testers.


To us, the model looks less naive now. But there’s another wrinkle: practice can generate all manner of data, but we guess that only some of it will become theory. To accommodate that, we think of theory as something that doesn’t necessarily have explanatory power but is simply the data that we care to keep track of.

The loop now makes more intuitive sense. Here is it again, with a theory/data distinction:
  • Do I have theory that gives me what I need?
  • If yes, then stop.
  • If no, then practice to generate data.
  • Repeat.

When starting a project we wonder whether we have data that answers whatever question the project is set up to answer. If not, then we set out to generate that data. This generation may be by manipulating the data we have (traditionally the realm of theorists) or manipulating some external system (traditionally practice).

It may be that both types of practice are required. Whichever it is, the theory that we have to start with, and ideas about how to get it, drive whatever practices are carried out. The data ultimately underpins everything. As Rikard Edgren warns in the Little Book of Testing Wisdom:
Software is messy. That's why empirical evidence is so valuable.
Let’s explore how we think this model suggests explanations for some phenomena that we see around testing practice and theory. Take a very simple case where a tester is exploring an application which takes letters as input. Perhaps she finds that:
  • entering data item A results in output X, B results in output Y, and C is disallowed,
  • the text box in the UI restricts entry to one letter at a time but there is an API through which bulk entry is possible.
The tester has asked a question — “what can this application do?” — and found answers, or data. The data that she tracks is her theory. The theory influences further interactions. In this case, we perceive two broad types of theory: behavioural (what the system does) and practical (how to exercise the system).  In a typical project the first of these is likely to get reported back to the team in general, but the latter is likely to remain local with either this tester or her peers when they share information that helps them to do their job.

We might consider these behavioural and practical flavours of theory to represent expertise and experience respectively. For us, both can usefully guide practice, even in the world where the theorist/practitioner distinction that we reject holds some sway:
It's possible to have degrees of [experience or expertise], or both, or to have neither in some area. Apart from its utility as a heuristic for remembering to look to test at different levels, it serves to remind me that the people on any project have different combinations of each and their mental models reflect that.
The manager knows there's a hammer that can be applied to the current problem: just how hard can it be? The implementer knows what to hit, and where, and just how hard it can be.


As testers we’ll naturally be alert to edge cases, the less usual scenarios. A typical edge would be the initial starting condition for a system. Let’s look at that now, and see how the model we’re building might handle it.

Imagine a tester joining a project that’s working on a product that she has never heard of in a market she didn’t know existed for users that she has no knowledge of nor immediately empathy for. She has no expertise in this particular system but she has experience of systems in general. Perhaps she decides that her mission is to make a mental model of the product and begins to explore it, using cues from other software systems she has known as an initial comparison point.

Her theory in this situation is biased to the practical. Or, perhaps more accurately in our conception of it, the subset of her theory that she chooses to use to guide her initial practice is biased towards experience. As she works she generates more data, and the pieces she cares to keep track of become new theory for her.

Another tester dropped into the same project at the same point might have chosen a different route, even with the same mission. He might have begun by reading all of the available project documentation, stories, bug reports, and support tickets, by talking to all of the team members, by reviewing the code and so on. This is also practice, to us, and again it is based on experience rather than product or project expertise (because this tester has none), and it also generates data of which some portion becomes theory for this tester.

In this example, two testers faced with the same situation chose two different approaches. We made it sound like conscious choices, but that needn’t be the case. We’re all human, so factors such as our biases, available time or other resources, pressure from others, incorrect analyses, inattention, preferences, and familiarity with tools can all impact on the choice too.

Again, in this example, although we made the scenario the same we didn’t restrict the personal theory that the testers carried with them into the project. This can, and will, influence actions alongside any shared theory in the project. Ease of sharing is an extremely positive aspect of theory. Often, in fact, theory can be shared in much less than the time taken to generate the data which lead to the theory. In a software project, we might imagine a team of testers reporting their findings to a product owner who takes from it and works on it to generate her own data and hence theory.

This PO is a practitioner. She might not practice on a physical system but her interactions with data have the same potential for experience and expertise as any interaction with a physical system:
experience: she finds Excel a convenient way to combine the reports on this project given that they are generated as csv files from session-based test management notes by the team members.
expertise: she recognises different terminology in several reports as referring to the same kind of phenomenon and combines them, and rates that issue more important because it’s clearly affecting a significant portion of the product.
Her report is theory for her, and potentially for whoever she chooses to share it with. It might go back to the team, to other teams, to a project manager, senior management and so on. Within the team, her theory will guide action. For example, the testers might be asked to run new experiments or change the way they work, or the PO might decide that it's time to ship. Outside of the team, it may be added to the set of theory for someone else, or it might be ignored or lost, or read carefully without the importance being recognised.

Having access to data is not the same as understanding or exploiting data. A practitioner will not report all of the data they collect nor all of the theory they generate to the PO in this scenario. Even if trying to be exhaustive, they will naturally provide only an abstraction of their complete knowledge, and (if our experience is anything to go by) it will tend to be biased more towards the expertise, to how the system works. In another scenario, such as a community of practice or a peer conference, they might provide a different abstraction, perhaps biased more towards experience, how to work the system. We find David Kolb’s Learning Cycles interesting here.

Also interesting to us, in Why is data hard? Helen L Kupp identifies four levels of abstraction around data: infrastructure, metrics and dimensions, exploration and tools, and insights:
Having lots of data doesn’t make it immediately valuable. And ... not only is leveraging data and metrics well critical to effective scaling, but it is also that much harder because you are “building the plane as it is flying”.
At each level she describes how practitioners operate on the data to generate value, the theory. Interestingly, she also talks about how these layers interact with and feed back to each other:


This is an appealing potential elaboration on the distinction between theorists (nearer the top) and practitioners (towards the bottom) which also emphasises data as the key thing that binds the various parties, and the different ways in which those parties act on the data or in pursuit of theory.

A distinction that we haven’t yet covered is that between tacit and explicit theory. All of the examples we’ve given involve theory that’s out in the open, known about and deliberately shared or stored, explicit. But theory can also be tacit, that is, internalised and difficult to describe. Think how easy it is to ride a bike once you’ve got it, but how hard it would be to explain all of the things you do when riding, and why.

Tacit theory can be both advantageous and problematic. An expert practitioner may add enormous value to a project by virtue of being able to divine the causes of problems, find workarounds for them, and gauge the potential impacts on important scenarios. But they may find it hard to explain how they arrived at their conclusions and asking them to try might impact on the extent to which the theory remains tacit and hence directly and intuitively accessible to them.

This kind of tacit knowledge is learned over time, and with experience, and likely also with mistakes. There is some mythology around the status of mistakes with respect to learning, Scott Berkun says “you can only learn from a mistake after you admit you’ve made it” and it’s not uncommon to hear people claim “don’t worry, you learn more from your mistakes” when you’ve messed up.

We’re not sure we agree with this. Mistakes are actions, which may generate data, some of which may become theory. At the time a practitioner takes an action and has the chance to see that data they may not realise there was a mistake, but there is still learning to be had. Perhaps the insight here is that mistakes tend to generate how-to data (experience) because the results of a mistake are less likely to be shareable (so of less expertise benefit).

In the kinds of situations we find ourselves in day-to-day there are many actors operating in a given system, for instance testers, developers, technical authors, application specialists, Scrum masters, project managers. Each of them is using their practical skills to take some input, make some observations, create some data, and generate some theory.

There are no clear levels between which data and theory are passed. There are multiple types of theory. There is no place in which all data and theory are, or could be, shared (because a substantial portion of it is tacit). Beware, though, because as theory is generated further and further away from the point at which its underlying data was gathered, the risk of successive abstractions being too narrow increases. This can be seen in the classic reporting funnel where information gets less and less tied to reality the further up a reporting chain it goes.

Back to the original question, then. It theory primary? Is practice primary? We think that data is primary and both theory and practice serve to guide the generation and exploitation of data:
  • Practice generates data.
  • Data makes theory.
  • Theory guides practice.
  • Practice generates data
  • ...
The required data, rather than theory or practice, is the starting point, and also the ending point. Once the required data is in hand, the loop can stop. Which, we hope you’ll agree, is a neat, ahem, theory. But what practical value does it have? For us, there are a few useful pointers:

The theorist/practitioner distinction has no real validity. We’re all on Team Practice, although we may have different levels of interest, ability, and inclination in practising in particular ways.

You can consciously make yourself more widely useful by developing skills in practices that apply across contexts and having theory around those practices which is transferable. The flip side is that depth of theory in a particular area will likely be reduced.

Don’t just pick up your favourite tool and begin hacking away. Whatever your biases, you can start any task by asking what data you need, then making a conscious choice about how to get it.
Image: Helen Kupp