Time to look at the first results from my project! Well, not quite – the first results are in a literature survey I did immediately after starting the project and made into a journal manuscript. I’m currently waiting for the first round of reviews to come in, but in the meantime I’ve been busy developing my ideas about collaborative knowledge discovery into something a bit more concrete. In particular, I’ve been thinking about one of the potential obstacles to effective collaboration from the data owner’s perspective: privacy.
In the aftermath of the much publicised Facebook–Cambridge Analytica scandal, one would at least hope that people are becoming more wary about sharing their personal data online. On the other hand, with the General Data Protection Regulation in full effect since 25 May, a huge number of people are now covered by a piece of legislation that grants them an extensive set of personal data control rights and has the power to hurt even really big players (like Facebook) if they don’t respect those rights. Of course, it’s still up to the people to actually exercise their rights, which may or may not happen, but after all the GDPR news, emails and “we use cookies” notices on websites, they should be at least vaguely aware that they have them.
The increased awareness of threats to privacy online and the assertion of individuals, rather than corporations, as the owners of their personal data are welcome developments, and I like to think that what I’m trying to accomplish is well aligned with these themes. After all, the collaborative knowledge discovery platform I’m building is intended to empower individual data owners: to help them extract knowledge from their own data for their own benefit. This does not make the privacy issue a trivial one, however – in fact, I wouldn’t be surprised if it turned out that people are more uneasy about sharing a small portion their data with an individual analyst focusing on their case specifically than about using an online service that grabs and mines all the data it can but does so in a completely impersonal manner. The platform will need to address this issue somehow lest it end up defeating its own purpose.
The angle from which I decided to approach the problem involves using a domain ontology and a semantic reasoner, which are technologies that I had been interested in for quite some time but hadn’t really done anything with. As I was doing the literature survey, I became increasingly convinced that an underlying ontology would be one of the key building blocks of the new platform, but it was also clear to me that I would need to start by modelling some individual aspect of collaboration as a proof of concept, so that I would fail fast if it came to that. If I started working top-down to produce a comprehensive representation of the entire domain, in the worst case I might take ages to discover nothing but that it wasn’t a very viable approach after all.
All this came together somewhat serendipitously when I found out that the 2nd International Workshop on Personal Analytics and Privacy (PAP 2018), held in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2018) in Dublin, had an open call for papers. The submission deadline was coming up in a month – enough time to put together some tentative results, though nothing hugely impressive – and coincided rather nicely with the date when I was planning to fly to Finland for my summer holidays. In about two weeks I had the first version of the manuscript ready, with another two left over for revisions.
The ontology I designed is based on the idea of a data owner and a data analyst (or possibly any number of either) using the collaborative knowledge discovery platform to negotiate the terms of their collaboration. Each uses the platform to specify requirements, but from opposing perspectives: the data analyst specifies analysis tasks, which require certain data items as input, while the data owner specifies privacy constraints, which prevent certain data items from being released to the data analyst. The data owners, data analysts, data items, analysis tasks and privacy constraints are all registered as individuals in the ontology and linked with one another such that a reasoner is able to use this information to detect conflicts, that is, situations where a data item is required for a data analysis task but not released by the data owner.
To resolve such conflicts, the data owner and the data analyst may, for example, agree that the analyst receives a version of the dataset from which the most sensitive information has been removed. Removing information reduces the utility of the data, but does not necessarily make it completely useless; finding a balance where the data owner’s privacy preferences are satisfied while the data analyst still gets enough material to work with is the essence of the negotiation process. The ontology is meant to support this process by not just pointing out conflicts, but by suggesting possible resolutions based on recorded knowledge about the utility effects of different methods of transforming data to make it less sensitive.
For the PAP workshop paper, I only had time to design the logic of conflict detection in any detail, and there also was no time to test the ontology in a real-world scenario or even a plausible approximation of one. It therefore hardly seems unfair that although the paper was accepted for a short oral presentation at the workshop, it was not accepted for inclusion in the post-proceedings. Obviously it would have been nicer to get a proper publication out of it, but I decided to go ahead and give the presentation anyway – ECML-PKDD is the sort of conference I might have gone to even if I didn’t have anything to present, and since the venue is a 25-minute walk away from my house, the only cost was the registration fee, which I could easily afford from the rather generous allowance for sundry expenses that came with the MSCA fellowship.
Croke Park may seem like an unlikely place to have a conference, but it is in fact a conference centre as well as a stadium, and seems to work perfectly well as a venue for an academic event – meeting spaces, catering and all. Besides Croke Park, we had Mansion House for the welcome reception and Taylor’s Three Rock for the conference banquet, so can’t complain about the locations. The regular programme was quite heavy on algorithms, which isn’t really my number one area of interest, but I did manage to catch some interesting application-oriented papers and software demos. What I enjoyed the most, however, were the keynote talks by Corinna Cortes, Misha Bilenko and Aris Gionis; there were two others that I’m sure I also would have found very interesting but was unable to attend, because there was a rather important deadline coming up and so I had to zig-zag between Croke Park and DCU to make sure I got everything finished on time.
My own talk went reasonably well I felt, with an audience of about twenty and some useful discussion afterwards on how I might go about modelling and quantifying the concept of utility reduction. On the last day of the conference, which was today, I went to another workshop, the 3rd Workshop on Data Science for Social Good (SoGood 2018), with presentations on how machine learning and data mining techniques can be used to address societal issues such as homelessness and corruption. I especially enjoyed the last one, if enjoy is the right word – it dealt with efforts to combat human trafficking by means of data science, certainly a worthy cause if ever there was one, but also rife with difficulties from the scarcity of good input data to the nigh-impossibility of devising an ethically justifiable experiment when there are literally lives at stake. Plenty of food for thought there, and a fine way to finish off this week of conference activities; on Monday it’s back to business as usual.