Philosophical ruminations vol. 1

Holidays are over and I’m back from the Finnish winter wonderland in Ireland, who seems to retain an appreciable fraction of her famous forty shades of green even in the middle of winter. No sign of the pre-Christmas frenzy anymore – I’ve been working at a fairly leisurely pace for these past few weeks, enjoying the luxury of being able to take the time to have a good think about what I’m doing before I actually do it. The only deadline of immediate concern was the extended deadline of the conference for which I was preparing a paper before the holidays, and since I didn’t dare rely on there being an extension, I all but finished the manuscript during the break, so there wasn’t much left for me to do to it after I got back on the clock.

Since things are not so hectic now, I thought this would be a good time for a post discussing a topic that’s not directly concerned with what’s going on in my project at the moment. When I started the blog, my intention was that one of the themes I would cover would be the philosophical dimension of knowledge discovery, and there’s a certain concept related to this that’s been on my mind quite a lot lately. The concept is known as epistemic opacity; that’s epistemic as in epistemology – the philosophical study of knowledge – and opacity as in, well, the state of not being transparent (thanks, Captain Obvious).

I ran into this concept in a paper by Paul Humphreys titled “The philosophical novelty of computer simulation methods”, published in the philosophy journal Synthese in 2009. Humphreys puts forward the argument that there are certain aspects of computer simulations and computational science that make them philosophically novel as methods of scientific enquiry, and one of these aspects is their epistemic opacity, which he defines as follows:

[…] a process is epistemically opaque relative to a cognitive agent X at time t just in case X does not know at t all of the epistemically relevant elements of the process. A process is essentially epistemically opaque to X if and only if it is impossible, given the nature of X, for X to know all of the epistemically relevant elements of the process.

That’s a bit of a mouthful, but the gist of it – as far as I understand it – is that computer simulations are opaque in the sense that there is no way for a human observer to fully understand why a given simulation behaves the way it does. This makes it impossible to verify the outcome of the simulation using means that are independent of the simulation itself; a parallel may be drawn here with mathematics, where there has been criticism of computer-generated proofs that are considered non-surveyable, meaning that they cannot be verified by a human mathematician without computational assistance.

The philosophical challenge here arises from the fact that since we have no means to double-check what the computer is telling us, we are effectively outsourcing some of our thinking to the computer. To be fair, we have been doing this for quite some time now and it seems to have worked out all right for us, but in the history of science this is a relatively new development, so I think the epistemologists can be excused for still having some suspicions. I doubt that anyone is suggesting we should go back to relying entirely on our brains (it’s not like those are infallible either), but I find that in any activity, it’s sometimes instructive to take a step back and question the things you’re taking for granted.

The algorithms used in knowledge discovery from data can also be said to be epistemically opaque, in the sense that while they quite often yield a model that works, it’s a whole different matter to understand why it works and why it makes the mistakes that it does. And they do make mistakes, even the best of them; there’s no such thing as a model that’s 100% accurate 100% of the time, unless the problem it’s supposed to solve is a very trivial one. Of course, in many cases such accuracy is not necessary for a model to be useful in practice, but there is something about this that the epistemologist in me finds unsatisfying – it feels like we’re giving up on the endeavour to figure out the underlying causal relationships in the real world and substituting the more pedestrian goal of being able to guess a reasonably accurate answer with adequate frequency, based on what is statistically likely to be correct given loads and loads of past examples.

From a more practical point of view, the opacity of KDD algorithms and the uncertainty concerning the accuracy of their outputs may or may not be a problem, since some users are in a better position to deal with these issues than others. Traditionally, KDD has been a tool for experts who are well aware of its limitations and potential pitfalls, but it is now increasingly being packaged together with miniaturised sensors and other electronics to make a variety of consumer products, such as the wearable wellness devices I’m working with. The users of these products are seldom knowledge discovery experts, and even for those who are, there is little information available to help them judge whether or not to trust what the device is telling them. The net effect is to make the underlying algorithms even more opaque than they would normally be.

Now, I presume that by and large, people are aware that these gadgets are not magic and that a certain degree of skepticism concerning their outputs is therefore warranted, but it would be helpful if we could get some kind of indication of when it would be particularly good to be skeptical. I suspect that often it’s the case that this information exists, but we don’t get to see it basically because it would clutter the display with things that are not strictly necessary. Moreover, this information is lost forever when the outputs are exported, which may be an issue if they are to be used, for instance, as research data, in which case it would be rather important to know how reliable they are. I’d be quite interested in seeing a product that successfully combines access to this sort of information with the usability virtues of today’s user-friendly wearables.

Dear Santa

Now that I’ve managed to clear away all of the stressful and/or boring stuff that was keeping me busy, time to do something fun: Christmas shopping! After the break my project is going to be almost halfway through, and although it will be a good while yet before I’m ready to start conducting user tests, it’s time to start getting serious about recruiting participants. After all, the tests are supposed to be about analysing the participants’ data, so they can’t just walk in at their convenience – I need them to spend some time collecting data first, and to do that, they’ll need something to collect the data with.

Our initial idea was to recruit people who are already using a sleep monitor of some kind, and I’m sure we’ll be able to find at least a few of those, but naturally we’ll have a bigger pool of candidates if we have a few devices available to loan to people who don’t have one of their own. Also, it’s obviously useful for me to play with these devices a bit so I can get a better idea of what sort of data they generate and what’s the best way to export it if I want to use it for my research (which I do). Besides, I’m hardly going to spend my entire expense budget on travel even if I go out of my way to pick the most remote conferences I can find to submit papers to.

So I didn’t need to worry too much about what I can afford – one of the many great things about the MSCA fellowship – but that doesn’t mean that the choice of what to buy was straightforward, because the range of consumer products capable of tracking sleep is, frankly, a little bewildering. Some devices you wear on your body, some you place in your bed and some at the bedside, and although I soon decided to narrow down my list of options by focusing on wearables, that still left me with more than enough variety to cope with. Some of these gadgets you wear on your wrist, while others go on your finger like a ring, and the wrist-worn ones range from basic fitness bracelets to high-end smartwatches that will probably make you your protein smoothie and launder your sports gear for you if you know how to use them.

One thing that made the decision quite a lot easier for me is that the manufacturers of fitness bracelets now helpfully include all of their sleep tracking functionality in models that are near the low end of the price spectrum, and since I’m only interested in sleep data, there was no need to ponder if I should go with the inexpensive ones or invest in bigger guns. Also, I had a preference for products that don’t make you jump through hoops if you want to export your data in a CSV file or similar, so I looked at the documentation for each of my candidates and if I couldn’t find a straight answer on how to do that, I moved on. In the end I settled on three different ones: the Fitbit Alta HR, the Withings Steel, and the Oura Ring.

What I particularly like about this trio is that each of these models represents a distinct style of design: the Fitbit is a modern bracelet-style gadget, whereas the Withings looks more like a classic analog wrist watch, and the Oura is, well, a ring. I can thus, to a certain extent, cater for my study participants’ individual stylistic preferences. For example, I’m rather partial toward analog watches myself, so I’d imagine that for someone like me the design of the Withings would have a lot of appeal.

Today’s my last day at work before the Christmas break, and things are wrapping up (no pun intended) very nicely. The orders for the sleep trackers went out last week, this morning I submitted the last of my (rather badly overdue) ethics deliverables to the European Commission, and just minutes ago I came back from my last performance with the DCU Campus Choir for this year. The only thing that may impinge on my rest and relaxation over the next couple of weeks is that there’s a conference deadline coming up immediately after my vacation and I’m quite eager to submit, but I shouldn’t need to worry about that until after New Year. Happy holidays, everyone!

Busy times

With the end-of-year holidays approaching, things tend to get busy in a lot of places, not just in Santa’s workshop. My life in Ireland is no exception: there are five major work-related (or at least university-related) things that I’ve been trying my best to juggle through November, with varying success. Many of these will culminate over the next two weeks or so, so after that I’m hoping it will be comparatively smooth sailing till I leave for my well-deserved Christmas break in Finland. The blog I’m not even counting among the five and I’ve been pretty much neglecting it, so this post is rather overdue, and also a welcome break from all of the more pressing stuff that I should really be working on right now.

One area where I’ve had my hands full is data protection, where it seems that whenever a document is finished, there’s always another one to be prepared and submitted for evaluation. Getting a green light from the Research Ethics Committee was a big step forward, but there’s now one more hurdle left to overcome in the form of a Data Protection Impact Assessment. I’m very much learning (and making up) all of this as I go along, and the learning curve has proved a rather more slippery climb than I expected, but I’m getting there. In fact, I’m apparently one of the first to go through this process around here, so I guess I’m not the only one trying to learn how it works. I hope this means that things will be easier for those who come after me.

Meanwhile, I’ve been preparing to give my very first lecture here at DCU – thankfully, just one guest lecture and not a whole course, but even that is quite enough to rack my nerves. It is a little strange that this should be the case, even after all the public speaking I’ve had to do during my fifteen-plus years in research, but the fact of the matter is that it does still feel like a bit of an ordeal every time. Of course it doesn’t help that I’m in a new environment now, and also I’ll be speaking to undergraduate students, which is rather different from giving a presentation at a conference to other researchers. Still, I’m not entirely unfamiliar with this type of audience, and I can recycle some of the lecture materials I created and used in Oulu, so I think I’m going to be all right.

Speaking of conferences, I’m serving in the programme committee of the International Conference on Health Informatics for the second year running and the manuscript reviewing period is currently ongoing, so that’s another thing that’s claimed a sizable chunk of my time recently. Somewhere among all of this I’m somehow managing to fit in a bit of actual research as well, although it’s nowhere near as much as I’d like, but I guess we’ve all been there. The software platform is taking shape towards a minimum viable product of sorts, and I have a couple of ideas for papers I want to write in the near future, so there’s a clear sense of moving forward despite all the other stuff going on.

So what’s the fifth thing, you ask? Well, I’ve rekindled my relationship with choral singing by joining the DCU Campus Choir, having not sung in a proper choir since school. Despite the 20-year gap (plus a bit), I haven’t had much trouble getting into it again: I can still read music, I can still hit the bass notes, and I don’t have all that much to occupy myself in the evenings and weekends so I have plenty of time to learn my parts (although I’m not sure how happy my neighbours are about it). The material we’re doing is nice and varied, and the level of ambition is certainly sufficient, as it seems like we’re constantly running out of rehearsal time before one performance or other. Our next concert will be Carols by Candlelight at DCU’s All Hallows campus on the evening of Monday the 10th of December, so anyone reading this who’s in town that day is very warmly welcome to listen!

Sleepytime

I recently obtained approval for my research from the DCU Research Ethics Committee, so I’m now officially good to go. This might seem like a rather late time to be getting the go-ahead, considering that I’ve been doing the research since February, but so far the work has been all about laying the foundations of the collaborative knowledge discovery software platform (for which I’m going to have to come up with a catchy name one of these days). This part of the project doesn’t involve any human participants or real-world personal data, so I’ve been able to proceed with it without having to concern myself with ethical issues.

As a matter of fact, if it were entirely up to me, the ethics application could have waited until even later, since it will be quite a while still before the platform is ready to be exposed to contact with reality. However, the Marie Curie fellowship came with T&Cs that call for ethics matters to be sorted out within a certain time frame, so that’s what I’ve had to roll with. I’d never actually had to put together an application like this before, so perhaps it was about time, and presumably it won’t hurt that some important decisions concerning what’s going to happen during the remainder of the project have now been made.

One of the big decisions I’d been putting off, but couldn’t anymore, was the nature of the scenario that I will use to demonstrate that the software platform is actually useful for the purpose for which it’s intended. This will be pretty much the last thing that happens in the project, and before that the software will have been tested in various other ways using, for example, open or synthetic data, but eventually it will be necessary to find some volunteers and have them try out the software so I can get some evidence on the workability of the software in a reasonable approximation of a real-world situation. It’s hardly the most controversial study ever, but it’s still research on human subjects and there will be processing of personal data involved, so things like research ethics and the GDPR come into play here and need to be duly addressed.

What I particularly needed a more precise idea about was the data that would be processed using the software platform. In the project proposal I said that this would be lifelogging data, but that can mean quite a few different things, so I needed to narrow it down to something specific. Of course it wouldn’t make sense to develop a platform for analysing just one specific kind of data, so as far as the design and implementation of the software is concerned, I have to pretend that the data could be anything. However, the only way I can realistically expect to be able to carry out a meaningful user test where the users actually bring their own data is by controlling the type of data they can bring.

There were a few criteria guiding the choice of the type of data to focus on. For one thing, the data had to be something that I knew to be already available at some sources accessible to me, so that I could run some experiments on my own before inflicting the software on others. Another consideration was the availability of in-house expertise at the Insight Centre: I’ve never done any serious data mining myself, having always looked at things from more of a software engineering perspective, so it was important that there would be someone close by who knows about the sort of data I intend to process and can help me ensure that the platform I’m building has the right tools for the job.

When I discussed this issue with my supervisor, he suggested sleep data – I’m guessing not least because it’s a personal interest of his, but it does certainly satisfy the above two criteria. Furthermore, it also satisfies a third one, which is no less important: there are many different devices in the market that are capable of tracking your sleep, and these are popular enough that it shouldn’t be a hopeless task to find a decent number of users to participate in testing the software. The concept of lifelogging if often associated with wearable cameras such as the Microsoft SenseCam, but these are much more of a niche product, making photographic data a not very attractive option – which it in fact was anyway because of the privacy implications of various things that may be captured in said photographs, so we kind of killed two birds with one stone there.

Capturing and analysing sleep data is something of a hot topic right now, so in terms of getting visibility for my research, I guess it won’t hurt to hop on the bandwagon, even though I’m not aiming to develop any new analysis techniques as such. Interestingly, the current technology leader in wearable sleep trackers hails from Oulu, Finland, the city where I lived and worked before joining Insight and moving to Dublin. There’s been quite a lot of media buzz around this gadget recently, from Prince Harry having been spotted wearing one on his Australian tour to Michael Dell announcing he’s decided to invest in the company that makes them. I haven’t personally contributed to the R&D behind the product in any way, but I feel a certain amount of hometown pride all the same – Nokia phones may have crashed and burned, but Oulu has bounced back and is probably a lot better off in the long run, not depending so heavily on a single employer anymore.

A Solid foundation for social apps?

Tim Berners-Lee recently posted an open letter on the web, announcing the launch of Solid, a new technology platform that he and his team at MIT have been working on for the past few years, to the wider online community. Like a lot of people these days, he’s not too happy about the way our personal data is being controlled and exploited by providers of online services, but when the father of the web is telling you how it’s not gone the way he intended, you may want to prick up your ears even if you personally have no problem with the way things are. Not only that, but when he says he’s come up with something that we can use to set things right, it’s probably worth checking out.

We’ve all seen the headlines that result when a company with a business model based on aggregating and monetising personal data gets negligent or unscrupulous with the data in its possession, but these incidents are really just symptoms of a more fundamental issue concerning the architecture of basically every popular online social application out there. Even if we imagine a perfect world of ideal application providers that are completely open and honest about how they use your data and never suffer any security breaches, the fact remains that they, not you, control the data you’ve given them. You still own it, yes, but they control it.

Why is this an important distinction? The answer has to do with the coupling of your data with the specific services you’re using: you can’t have one without the other. As a result, your data is broken up into pieces that are kept in separate bins, one for each service, even when it would be really to helpful to have it all in the same place. If you want to use several services that all use the same data, you have to upload it to each one separately, and that’s assuming that you have or can get the data in a reusable format, which isn’t always the case. It would make a lot more sense to have just a single copy of the data and permit the services to access that – within privacy parameters that you have complete control of – and it would be even better if you could move your data to a different location without breaking all those services that depend on it.

Sound good? Well, the people behind Solid apparently want you to be able to do just that. Their proposed solution is based on decoupling data from applications and storing it in units called PODs (short for personal online data store). Applications built on the Solid platform can access the data in your POD if you give them permission to do so, but they don’t control the data, so they can’t impose any artificial restrictions on how you use, combine and reuse data from different sources. The end-users of Solid are thus empowered to make the best possible use of their data while retaining full control of what data they disclose and to whom, which is very much what I’m aiming for in my own research; I can easily see collaborative knowledge discovery as an app implemented on Solid or some similar platform.

So that’s the theory, anyway. What about reality? I can’t claim to have examined the platform in great depth, but at least on the surface, there are a number of things that I like about it. It’s built on established W3C specifications in what looks like a rather elegant way where linked data technologies are used to identify data resources and to represent semantic links between them – for example, between a photo published by one user and a comment on the photo posted by another. Besides your data, your POD also holds your identity that you use to access various resources, somewhat like you can now use your Google or Facebook credentials to log in to other vendors’ services, but without the dependence on a specific service to authenticate your identity. Of course, you still need to get your Solid POD from somewhere, but you’re free to choose whichever provider suits you best, or even to set up your own Solid server if you have the motivation and the means.

Whether Solid will catch on as a platform for a new class of social web apps is not just a matter of whether it is technically up to the challenge, though. The point of social media is very much to have everyone in your social network using the same applications, so the early adopters won’t have much of an impact if their friends decide that it’s just so much more convenient to keep using the apps where they already have all their connections and content rather than to switch platforms and build everything up all over again – which, of course, is precisely the sort of thinking the providers of those apps are counting on and actively reinforcing. People like me may give Solid a go out of sheer curiosity, but I suspect that the majority can’t be bothered unless there are Solid apps available that let them do things they really want to do but haven’t been able to before. Taking control of your own data is a noble principle for sure, but is it enough to attract a critical mass of users?

Then there’s the question of how the Solid ecosystem will work from a business perspective. The supply of interesting applications is going to be quite limited unless there’s money to be made by developing them, and presumably the revenue-generation models of centralised social apps can’t be simply dropped in a decentralised environment such as Solid without any modifications. We pretty much take it for granted now that we can “pay” for certain kinds of services through the act of using them and generating data for the service provider to use as raw material for services that the provider’s actual customers will pay good money for, but would – and should – this work if the provider could no longer control the data? On the other hand, would we be willing to pay for these services in cash rather than data, now that we’ve grown so used to getting them for “free”? Then again, there was a time when it was not at all clear how some of today’s multi-billion-dollar companies were ever going to turn a profit, so maybe we just need the right sort of minds to take an interest for these things to get figured out.

It’s also worth noting that Solid is by no means the only project aiming to make the web less centralised and more collaborative. There is a substantial community of researchers and developers working on solutions to various problems in this area, as evidenced by the fact that Solid is but one of dozens of projects showcased at the recent DWeb Summit in San Francisco, so it may well turn out that even if Solid itself won’t take off, some other similar thing will. I won’t be betting any money on any of the contenders just yet, but I probably will get myself a Solid POD to play with so I can get a better idea of what you can do with it.

Tips for MSCA hopefuls

This year’s call for MSCA Individual Fellowship applications closed recently, with a grand total of 9,830 applications received – apparently a record number for MSCA and in fact for Horizon 2020 in general. Good luck to everyone who submitted! Soon after I started my own fellowship at the Insight Centre, I was invited by one of the people who helped me prepare my proposal to participate in a seminar on MSCA and speak on my experiences as a successful candidate. The presentation I gave there was quite well received, so I thought I’d share my little tips and tricks in the blog as well, even though the timing isn’t arguably the greatest, given that it won’t be until sometime in the spring that the next call opens.

First of all, if you’re considering applying but having some doubts, I heartily recommend that you go through with it. Although technically MSCA fellowships are H2020 projects, which may sound a bit frightening, the proposal process is actually quite lightweight, with the length of the research plan limited to ten pages and the budget being a simple function of which country you’re going to, how many months you will spend there and whether you have a family. The same goes for how the projects are managed, so you don’t need to worry that you’ll end up spending an inordinate portion of your precious research time cranking out deliverables instead of generating results. So, without further ado, here are my top 5 tips for would-be MSCA fellows:

1. Find the right host

I’ve already mentioned in a previous post that it boosts your chances considerably if the strengths of your prospective host complement yours. It certainly doesn’t hurt if there’s someone at the host institution – ideally, your prospective supervisor – that you already know and have developed a rapport with, but you shouldn’t get too hung up on that particular point; what really matters from the reviewers’ point of view is whether the place where you are proposing to carry out your project is the best possible environment for that project. Consider what the host can offer you in terms of things such as training, research infrastructure and potential collaborators, and make sure that you have a persuasive argument that comes across in your proposal. Also, keep in mind that there is expected to be two-way knowledge transfer between the researcher and the host, so it’s not just about what you can get from the host – it’s also about what you can bring to the host.

2. Get all the help you can

The most important part of the proposal is the actual research to be done – objectives, methodology, etc. – and that’s all up to you (plus, to a certain extent, your supervisor of course, but they’re likely to have quite a few things on their plate besides this). However, for everything else, don’t hesitate to take advantage of any support that the host institution can offer you in preparing the proposal. The odds are that there are people there who have done this sort of thing before and know what reviewers look for in a proposal in terms of facts and figures, hosting arrangements, available research services and so forth. They may also have access to external experts and offer to send your proposal to them for feedback, and I think it goes without saying that you should accept such an offer. What I found particularly useful was ideas on how to communicate my research results to non-academic audiences, since my first instinct (and I’m pretty sure I’m not alone in this) is to just write papers for journals and conferences and let others worry about public relations, and this cost me some crucial points when I applied for the first time.

3. It’s all about you

This is another thing I touched upon in that earlier post: MSCA fellowships are unusual, if not unique, in that their impact is measured in terms of the career development of the fellow as a European researcher. Therefore you should consider starting not with the question “What do I want to study?” but with “What do I want to be?” The answer won’t give you your research topic, but it will affect the way you go about choosing one and developing a plan around it. Do you want to work in academia or industry? In what sort of role? Or maybe you’re interested in starting your own company? Whatever your target is, state it clearly in the proposal and make sure that everything else in the proposal – research activities, training, etc. – is aligned with that target. If you’re not quite sure what you want and would prefer to keep your options open, pick a career goal anyway and pretend that you do know that’s where you’re headed; there’s nothing wrong with changing your mind later, but it doesn’t look good if you don’t seem to have any sort of long-term vision of your career. Of course, if you’ve come up with a work plan that can support multiple career paths equally well, it shouldn’t hurt if you point this out in the proposal.

4. Give details generously

This is really a more general formulation of the previous point: it can be tempting to keep things a bit vague, but every bit of vagueness will make your proposal seem that much less convincing – and remember, the bar is high and competition fierce, so every little bit counts. This goes for your career objectives, but other things as well; for example, when describing how you plan to disseminate the results of your research, try to come up with tentative titles for the papers you’re going to write and to identify specific journals and conferences where you will aim to publish those papers. If you can name some likely co-authors, even better, and it’s also good to consider how you will measure the impact of your dissemination and communication activities (e.g., number of paper citations, number of people reached). Likewise, in your implementation plan, provide as much detail as you can (without breaking the page limit) on things such as work breakdown, timetables, deliverables and milestones; in the real world, you won’t be expected to follow that plan to the letter, but you do need to demonstrate in the proposal that there’s a clear path from where you are now to where you want to be at the end of the project.

5. Focus your efforts right

Having only ten pages to explain your research plan in full detail is a blessing but also a curse, because rationing out those ten pages between the things you want to say may prove quite a challenge. To have an idea of where you should be concentrating your best efforts, keep always in mind the three evaluation criteria and their weights relative to your overall score: excellence counts for 50%, impact for 30% and implementation for 20%, so it’s a good rule of thumb to allot 5, 3 and 2 pages for the corresponding proposal sections, respectively. However, if you’re working on a revision of a proposal that didn’t get funding the first time around, you also need to consider your previous evaluation scores, because the law of diminishing returns applies as your score for a given criterion approaches 5. So, if you did very well on excellence but not quite as well on the other two criteria, you’re likely to get a bigger increase in your total score for the same amount of effort if you focus on impact and implementation, even though excellence weighs as much as the other two combined. You’ll definitely want to improve any criterion score lower than 4, and the verbal feedback in the evaluation report should give you a pretty good idea of how you can do that.

So that’s it! I hope you found these tips useful and will come back to them when it’s time to start preparing an application for the next MSCA IF call.

First blood

Time to look at the first results from my project! Well, not quite – the first results are in a literature survey I did immediately after starting the project and made into a journal manuscript. I’m currently waiting for the first round of reviews to come in, but in the meantime I’ve been busy developing my ideas about collaborative knowledge discovery into something a bit more concrete. In particular, I’ve been thinking about one of the potential obstacles to effective collaboration from the data owner’s perspective: privacy.

In the aftermath of the much publicised Facebook–Cambridge Analytica scandal, one would at least hope that people are becoming more wary about sharing their personal data online. On the other hand, with the General Data Protection Regulation in full effect since 25 May, a huge number of people are now covered by a piece of legislation that grants them an extensive set of personal data control rights and has the power to hurt even really big players (like Facebook) if they don’t respect those rights. Of course, it’s still up to the people to actually exercise their rights, which may or may not happen, but after all the GDPR news, emails and “we use cookies” notices on websites, they should be at least vaguely aware that they have them.

The increased awareness of threats to privacy online and the assertion of individuals, rather than corporations, as the owners of their personal data are welcome developments, and I like to think that what I’m trying to accomplish is well aligned with these themes. After all, the collaborative knowledge discovery platform I’m building is intended to empower individual data owners: to help them extract knowledge from their own data for their own benefit. This does not make the privacy issue a trivial one, however – in fact, I wouldn’t be surprised if it turned out that people are more uneasy about sharing a small portion their data with an individual analyst focusing on their case specifically than about using an online service that grabs and mines all the data it can but does so in a completely impersonal manner. The platform will need to address this issue somehow lest it end up defeating its own purpose.

The angle from which I decided to approach the problem involves using a domain ontology and a semantic reasoner, which are technologies that I had been interested in for quite some time but hadn’t really done anything with. As I was doing the literature survey, I became increasingly convinced that an underlying ontology would be one of the key building blocks of the new platform, but it was also clear to me that I would need to start by modelling some individual aspect of collaboration as a proof of concept, so that I would fail fast if it came to that. If I started working top-down to produce a comprehensive representation of the entire domain, in the worst case I might take ages to discover nothing but that it wasn’t a very viable approach after all.

All this came together somewhat serendipitously when I found out that the 2nd International Workshop on Personal Analytics and Privacy (PAP 2018), held in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2018) in Dublin, had an open call for papers. The submission deadline was coming up in a month – enough time to put together some tentative results, though nothing hugely impressive – and coincided rather nicely with the date when I was planning to fly to Finland for my summer holidays. In about two weeks I had the first version of the manuscript ready, with another two left over for revisions.

The ontology I designed is based on the idea of a data owner and a data analyst (or possibly any number of either) using the collaborative knowledge discovery platform to negotiate the terms of their collaboration. Each uses the platform to specify requirements, but from opposing perspectives: the data analyst specifies analysis tasks, which require certain data items as input, while the data owner specifies privacy constraints, which prevent certain data items from being released to the data analyst. The data owners, data analysts, data items, analysis tasks and privacy constraints are all registered as individuals in the ontology and linked with one another such that a reasoner is able to use this information to detect conflicts, that is, situations where a data item is required for a data analysis task but not released by the data owner.

To resolve such conflicts, the data owner and the data analyst may, for example, agree that the analyst receives a version of the dataset from which the most sensitive information has been removed. Removing information reduces the utility of the data, but does not necessarily make it completely useless; finding a balance where the data owner’s privacy preferences are satisfied while the data analyst still gets enough material to work with is the essence of the negotiation process. The ontology is meant to support this process by not just pointing out conflicts, but by suggesting possible resolutions based on recorded knowledge about the utility effects of different methods of transforming data to make it less sensitive.

For the PAP workshop paper, I only had time to design the logic of conflict detection in any detail, and there also was no time to test the ontology in a real-world scenario or even a plausible approximation of one. It therefore hardly seems unfair that although the paper was accepted for a short oral presentation at the workshop, it was not accepted for inclusion in the post-proceedings. Obviously it would have been nicer to get a proper publication out of it, but I decided to go ahead and give the presentation anyway – ECML-PKDD is the sort of conference I might have gone to even if I didn’t have anything to present, and since the venue is a 25-minute walk away from my house, the only cost was the registration fee, which I could easily afford from the rather generous allowance for sundry expenses that came with the MSCA fellowship.

Croke Park may seem like an unlikely place to have a conference, but it is in fact a conference centre as well as a stadium, and seems to work perfectly well as a venue for an academic event – meeting spaces, catering and all. Besides Croke Park, we had Mansion House for the welcome reception and Taylor’s Three Rock for the conference banquet, so can’t complain about the locations. The regular programme was quite heavy on algorithms, which isn’t really my number one area of interest, but I did manage to catch some interesting application-oriented papers and software demos. What I enjoyed the most, however, were the keynote talks by Corinna Cortes, Misha Bilenko and Aris Gionis; there were two others that I’m sure I also would have found very interesting but was unable to attend, because there was a rather important deadline coming up and so I had to zig-zag between Croke Park and DCU to make sure I got everything finished on time.

My own talk went reasonably well I felt, with an audience of about twenty and some useful discussion afterwards on how I might go about modelling and quantifying the concept of utility reduction. On the last day of the conference, which was today, I went to another workshop, the 3rd Workshop on Data Science for Social Good (SoGood 2018), with presentations on how machine learning and data mining techniques can be used to address societal issues such as homelessness and corruption. I especially enjoyed the last one, if enjoy is the right word – it dealt with efforts to combat human trafficking by means of data science, certainly a worthy cause if ever there was one, but also rife with difficulties from the scarcity of good input data to the nigh-impossibility of devising an ethically justifiable experiment when there are literally lives at stake. Plenty of food for thought there, and a fine way to finish off this week of conference activities; on Monday it’s back to business as usual.

Rocky road to Dublin

When I first arrived at DCU to begin my MSCA fellowship, my supervisor paraded me around the place introducing me to various people, most of whose names I promptly forgot. (Sorry!) What did stick in my mind, however, were the numerous congratulations I received on winning such a competitive grant, which the Marie Curie fellowship certainly is, and on getting such a sweet deal, which it most definitely is. But it wasn’t all good times and glowing reviews, the way here – far from it, in fact. That’s why I thought I’d share the story of how I came to be in Dublin and how things might have turned out quite differently, had I been a bit less perseverant than I was.

I got my doctorate in 2014 at the relatively mature age of 35, having gone through a process that was a good deal more convoluted than it was, in theory, supposed to be. Following my academic baptism of fire, working as a Master’s student in a project dealing with computational quality assurance of spot welding joints, I embarked on a rather erratic journey that saw me dabble with research topics as varied as exergame design and ethics of scam baiting. I eventually pieced together my dissertation around the theme of knowledge discovery in databases, focusing on what the overall process looks like from different perspectives and how it can be supported effectively.

Not only was the process of writing my thesis unnecessarily complicated, but so was the process of getting the manuscript reviewed and accepted for publication. When the pre-examination phase was finally over and I was galloping, or at least cantering briskly, into the home stretch, there turned out to be one more obstacle to clear: finding an external examiner for the defence. I pitched a number of names to my then-supervisor, but one by one they all respectfully declined the invitation, not having the time to spare or not feeling that the topic of the thesis was close enough to their area of expertise. Fortunately, one of them offered the suggestion that Alan Smeaton of DCU might be our man.

This turned out to be a considerable stroke of luck: Alan was indeed kind enough to accept the job, and his style of handling it was very much what I’d hoped for, making this test of my ability to defend my magnum opus feel much less like a test and more like a friendly conversation on a topic of mutual interest. Some among the audience even described the proceedings as entertaining, which is hardly the word that first comes to my mind when I think about thesis defences! In all honesty, it’s not the word I’d choose to describe mine either – I was far too nervous to be entertained – but it did feel quite rewarding after all the hard work (not to mention a not negligible amount of self-doubt) to talk for several hours as an equal to a senior academic who had taken such an interest in my research.

Having thus finished my doctorate, it was time to make some decisions. I felt that I wasn’t quite done with academia yet, but I also felt that I needed some kind of change, and besides, there’s a certain amount of pressure on fresh post-docs to go explore the world beyond their alma mater for a while. Since I happen to have a great deal of appreciation for many things Irish (including, but not limited to, grain-based fermented beverages and traditional music), this seemed like a potential opportunity to combine business with pleasure, so the next time I visited Dublin, I met up with Alan to have a cup of tea and a chat on the possibility of moving there to work with him, provided that some funding could be secured. Later that year, we submitted our first proposal for an MSCA Individual Fellowship to the European Commission.

MSCA wasn’t the only funding source we considered; I also applied to the Academy of Finland for a grant that would have involved me mostly staying in Finland at the University of Oulu but spending a mobility period of six months or so at DCU. However, the feedback I got from the Academy did not paint a rosy picture of my chances of winning the grant even after several iterations, and in the meantime, I was feeling rather aimless in my research and finding the idea of switching to industry more and more attractive. Unsurprisingly, my publication record for these past few years is not exactly impressive, but thankfully, the MSCA reviewers seemed to be more interested in what I wanted to accomplish than in what I’d (not) accomplished before. With the assistance of DCU’s excellent research support staff, Alan and I were able to put together a good plan, and then, after a very encouraging round of reviews, to make it even better for our second attempt. This was in 2016; in early 2017 we got the notification that the fellowship had been awarded. After that, I went on working in Oulu until the end of the year to complete my contract, took January 2018 off to wind down and make arrangements, moved to Dublin at the end of January and started the new project on the 1st of February.

If there’s a lesson to take home from all of this, I guess it would be that even if you haven’t been the most shining star of your peer group during your PhD studies, that doesn’t mean you’re out of options if you want to keep pursuing a career in research. Sure, some funders may dismiss you on the basis of your CV alone if they don’t think it shows enough potential, but still, a well prepared proposal can go a long way if you know your strengths and build your research plan around them. MSCA Individual Fellowships are perhaps a more forgiving funding source than many others, since they’re explicitly meant to help you advance your career and thus come with the built-in assumption that there’s something important you don’t yet have but can gain by carrying out the right sort of project with the right sort of host. So, if you’ve found yourself a host with a set of known strengths that complement the ones you’ve demonstrated in your previous work, you already have a pretty solid foundation to build your proposal on.

Getting started

Welcome to You Know Nothing, Socrates! The theme of this blog is knowledge, or more specifically – because that sure could use some narrowing down – the intersection of knowledge (in the philosophical sense) and computing. Knowledge, of course, is a notoriously elusive concept once you start trying to pin it down, which is why I’ve decided to name the blog after the famous Socratic paradox, apocryphal though it may be. And before you ask: yes, the title is also a Game of Thrones reference. Get over it.

To make matters worse, we haven’t been content to just assert that we as human beings have the ability to know various things and to derive new knowledge from evidence. Instead, ever since the invention of the modern digital computer, we’ve been very keen on the idea of replicating, or at least imitating, that ability in machines. This pursuit has given rise to fields of computer science research such as knowledge representation and knowledge discovery; this is the area where I’ve been working throughout my career as a researcher, and also the main subject area that I’ll be writing about.

A bit of context: I’m currently working as a Marie Curie Individual Fellow at the Insight Centre for Data Analytics in Dublin, Ireland. The project I’m working on, titled KDD-CHASER, deals with remote collaboration for the extraction of useful knowledge from personal data, such as one might collect using a wearable wellness device designed to generate meaningful metrics on the wearer’s physical activity and sleep. These products are quite popular and, presumably, useful, but for most users their utility is limited to whatever analyses the product has been programmed to give them. The research I’m doing aims for the creation of an online platform that could be used by users of personal data capturing devices to discover additional knowledge in their data with the help of expert collaborators.

As long as the KDD-CHASER project is running, which is until the end of January 2020, I will be using this blog as a communication channel (among others) to share information about its progress and results with the public. However, I’m also planning to post more general musings on topics that are related to, but not immediately connected with, the work I’m doing in the project. These, I hope, will be enough to keep the blog alive after the project is done and I move on to other things. Not that I’m expecting those other things to be radically different from the things I’m involved in at the moment, but hey, you never know.

There certainly isn’t a shortage of subject matter to draw on: besides the under-the-hood mechanics of computers capable of possessing and producing knowledge, there’s the philosophical dimension of them that I’m also deeply interested in – another reason for my choice of blog title. From here it’s not much of a conceptual leap to the even more bewildering philosophical questions surrounding the notion of artificial intelligence, so I might take the occasional stab at those as well. I fully expect to come to the conclusion that I really know absolutely nothing, but whether I’ll be any the wiser for it remains to be seen.