Knowledge discovery – You Know Nothing, Socrates

Whenever someone asks me what my research project is about, I usually open by saying we’re calling it collaborative knowledge discovery from data. That’s a nice, convenient way of putting it in a nutshell, but it immediately calls for some elaboration, especially on the meaning of the term “collaborative”. Technically, any activity that involves two or more people working together toward a common goal is collaborative, but this definition doesn’t get us very far, because in knowledge discovery you typically have at least someone who knows about the technology and someone who knows about the application domain. It’s not unheard of for one person to know about both, but still, I think it’s safe to say that collaboration is the rule rather than the exception here.

To narrow it down a bit, the kind of collaboration we’re talking about is remote and synchronous. In other words, the participants are not located in the same place, but they can all simultaneously edit whatever it is they’re collaborating on and see the effects of each other’s edits in real time. This implies that there must be some kind of online environment where the collaboration takes place; think something like Google Docs or MS Office Online, only for KDD artifacts such as datasets, algorithms and processing pipelines.

Even this is not a particularly novel idea in itself, as there are collaboration platforms already available where you can develop just these sorts of things. Therefore in KDD-CHASER we’re upping the ante even further by focusing specifically on collaborative knowledge discovery from personal data, driven by the data owner who cannot be assumed to have any particular technology or domain expertise. It’s a bit of a niche, which of course makes our eventual results somewhat less generalisable, but it also makes it considerably easier to spot opportunities for novel research contributions.

To me, the most interesting problems here are not necessarily associated with knowledge discovery as such but with the things that need to happen before knowledge discovery can take place. After all, from the data owner’s perspective the point of collaborating with experts is basically to have the actual discovery done by people who are better equipped for it in terms of skills and tools. This doesn’t mean, however, that the data owner’s role in the collaboration is limited to being a passive data source; on the contrary, it is the data owner’s needs that drive the entire process of collaborative KDD in the special case we’re considering.

The first problem that a data owner may encounter on the way to a successful collaboration is that they don’t even know anyone they could collaborate with, so the first thing the collaboration platform should do is provide a matchmaking service that brings together people who have data with people who have the right sort of expertise to help turn it into something more valuable. After the matchmaking follows the really interesting part: negotiation. What kind of knowledge is the data owner interested in? What is actually achievable, given the available data and the extent to which the data owner is willing to share it with the expert? What is the expert expecting to get in compensation for their efforts? The collaborators need to find the answers to such questions among themselves, and the collaboration platform should support them in this.

The bare minimum is to provide the collaborators with some kind of communication channel, but this is something that would be required anyway, and it’s hardly a research problem from a computing standpoint. However, there’s a lot more to negotiation than just talking, and I’m interested to see what I might do to help things along in this area. Privacy, for example, is traditionally close to my heart and something that I want to address also here, because one of the things to be determined through negotiation is how much of their data the data owner is prepared to trust their collaborators with, considering that the latter may be the KDD equivalent of someone they just matched with on Tinder.

It’s been pretty clear from the start that whatever we manage to accomplish in my current project, it’s not going to be a comprehensive solution to all the problems of collaborative KDD, even within the niche we’ve carved for ourselves. What we can realistically shoot for, though, is a model that shows us what the collaboration process looks like and gives us an understanding of where the major problems are. The software I’m building will basically be a collection of candidate solutions to a select few of these problems, and it will hopefully be something I can continue to build on when my MSCA fellowship is over.

Holidays are over and I’m back from the Finnish winter wonderland in Ireland, who seems to retain an appreciable fraction of her famous forty shades of green even in the middle of winter. No sign of the pre-Christmas frenzy anymore – I’ve been working at a fairly leisurely pace for these past few weeks, enjoying the luxury of being able to take the time to have a good think about what I’m doing before I actually do it. The only deadline of immediate concern was the extended deadline of the conference for which I was preparing a paper before the holidays, and since I didn’t dare rely on there being an extension, I all but finished the manuscript during the break, so there wasn’t much left for me to do to it after I got back on the clock.

Since things are not so hectic now, I thought this would be a good time for a post discussing a topic that’s not directly concerned with what’s going on in my project at the moment. When I started the blog, my intention was that one of the themes I would cover would be the philosophical dimension of knowledge discovery, and there’s a certain concept related to this that’s been on my mind quite a lot lately. The concept is known as epistemic opacity; that’s epistemic as in epistemology – the philosophical study of knowledge – and opacity as in, well, the state of not being transparent (thanks, Captain Obvious).

I ran into this concept in a paper by Paul Humphreys titled “The philosophical novelty of computer simulation methods”, published in the philosophy journal Synthese in 2009. Humphreys puts forward the argument that there are certain aspects of computer simulations and computational science that make them philosophically novel as methods of scientific enquiry, and one of these aspects is their epistemic opacity, which he defines as follows:

[…] a process is epistemically opaque relative to a cognitive agent X at time t just in case X does not know at t all of the epistemically relevant elements of the process. A process is essentially epistemically opaque to X if and only if it is impossible, given the nature of X, for X to know all of the epistemically relevant elements of the process.

That’s a bit of a mouthful, but the gist of it – as far as I understand it – is that computer simulations are opaque in the sense that there is no way for a human observer to fully understand why a given simulation behaves the way it does. This makes it impossible to verify the outcome of the simulation using means that are independent of the simulation itself; a parallel may be drawn here with mathematics, where there has been criticism of computer-generated proofs that are considered non-surveyable, meaning that they cannot be verified by a human mathematician without computational assistance.

The philosophical challenge here arises from the fact that since we have no means to double-check what the computer is telling us, we are effectively outsourcing some of our thinking to the computer. To be fair, we have been doing this for quite some time now and it seems to have worked out all right for us, but in the history of science this is a relatively new development, so I think the epistemologists can be excused for still having some suspicions. I doubt that anyone is suggesting we should go back to relying entirely on our brains (it’s not like those are infallible either), but I find that in any activity, it’s sometimes instructive to take a step back and question the things you’re taking for granted.

The algorithms used in knowledge discovery from data can also be said to be epistemically opaque, in the sense that while they quite often yield a model that works, it’s a whole different matter to understand why it works and why it makes the mistakes that it does. And they do make mistakes, even the best of them; there’s no such thing as a model that’s 100% accurate 100% of the time, unless the problem it’s supposed to solve is a very trivial one. Of course, in many cases such accuracy is not necessary for a model to be useful in practice, but there is something about this that the epistemologist in me finds unsatisfying – it feels like we’re giving up on the endeavour to figure out the underlying causal relationships in the real world and substituting the more pedestrian goal of being able to guess a reasonably accurate answer with adequate frequency, based on what is statistically likely to be correct given loads and loads of past examples.

From a more practical point of view, the opacity of KDD algorithms and the uncertainty concerning the accuracy of their outputs may or may not be a problem, since some users are in a better position to deal with these issues than others. Traditionally, KDD has been a tool for experts who are well aware of its limitations and potential pitfalls, but it is now increasingly being packaged together with miniaturised sensors and other electronics to make a variety of consumer products, such as the wearable wellness devices I’m working with. The users of these products are seldom knowledge discovery experts, and even for those who are, there is little information available to help them judge whether or not to trust what the device is telling them. The net effect is to make the underlying algorithms even more opaque than they would normally be.

Now, I presume that by and large, people are aware that these gadgets are not magic and that a certain degree of skepticism concerning their outputs is therefore warranted, but it would be helpful if we could get some kind of indication of when it would be particularly good to be skeptical. I suspect that often it’s the case that this information exists, but we don’t get to see it basically because it would clutter the display with things that are not strictly necessary. Moreover, this information is lost forever when the outputs are exported, which may be an issue if they are to be used, for instance, as research data, in which case it would be rather important to know how reliable they are. I’d be quite interested in seeing a product that successfully combines access to this sort of information with the usability virtues of today’s user-friendly wearables.

Category: Knowledge discovery

Collaboration, schmollaboration

Philosophical ruminations vol. 1