Deck the halls

It’s been a weird couple of months since I came back from my summer vacation. I haven’t kept track of how my working time has been split among the tools I’ve used, but if I had, I’m pretty sure that number one on the list would be PowerPoint. So many lectures and presentations! I guess it’s good that I get to work on my communication skills, and I do even quite enjoy it when I get to give a well-prepared presentation on a topic I’m genuinely interested in and have something original to say about, but still, enough is enough. I’m hoping this is just a temporary state of affairs, but if not, I may need to work on my saying no to speaker invitations skills.

Indeed, 2024 is already a record year for me in terms of the number of various speaking engagements I’ve had. There are two major reasons for this, the first one being the Reboot Skills project, in which I designed and implemented a course titled Data Governance and Privacy. In addition to the course sessions – three main ones in Finnish, plus an additional one in English in collaboration with the University of Limerick – I’ve attended at least three industry events where I spoke on the subject and pitched the course before it began. Despite these efforts, the course attracted a disappointingly small number of participants, but even so, I’m quite happy to lay it to rest for now and focus on other things.

The other reason is my work on AI ethics, which has gotten me invited to a bunch of seminars recently. This semester I’ve already participated in two: in August, there was a university pedagogy seminar where I presented again the results of my pilot study on integrating AI tools into AI ethics teaching, and a week ago I spoke on responsible AI in research in a seminar organised by the university’s Ethics Working Group. Coming up next is the Tethics conference, where I will both present a paper and co-host a workshop on technology ethics education, and at the end of November comes a seminar at the Finnish National Defence University in Helsinki, where I’m slated to give my perspective as an AI ethicist on the topic of AI in the battlefield. Nothing yet scheduled for December, but there’s still time…

Tethics, for me, is going to be a somewhat more hurried affair this year. I will be there for the whole duration of the conference, but instead of traveling the day before as I normally would, I’m going to take a night train that arrives in Tampere in the morning of the first day. The reason for this is that I have commitments in Oulu that prevent me from leaving much earlier than midnight on the night between the 5th and 6th of November. More specifically, on the evening of the 5th is the first of three dress rehearsals for a stage adaptation of Dickens’s A Christmas Carol at Oulu City Theatre, and I won’t be in town for the other two so I can’t afford to skip it.

That’s right, I’m going to be back on stage, less than a year after the end of The Magic Flute! The director is the same, and when I heard she was looking for singers for this production, it didn’t take me too long to decide that I want in. The only reason why I needed any time at all to think about it was that the rehearsals clash with those of Cassiopeia, so I’ve been mostly absent from the choir since the beginning of September. However, it’s not that often that an opportunity like this turns up, and the only big choir thing remaining this year is the traditional Christmas concert, so I figured now’s not the worst time to take a little break.

Compared to the opera, working on the play is notably different in a few respects. For one thing, instead of a whole chorus of forty singers there’s only a quartet, and we also have significantly more time on stage, so I have a bigger role now, even though I’m not playing an actual named character. I even have a couple of spoken lines! I’m also officially employed by the theatre this time – the pay is hardly worth mentioning, but just the fact that I’m getting money for something I’m basically doing as a hobby is pretty cool.

Artistically speaking, the biggest difference is that we’re not on stage as singers, but rather as actors playing singers. This may seem like semantic quibbling but is actually a significant distinction, as everything we do on stage must be in service of the story. To some extent this was the case also with The Magic Flute, but surely it would have been too sacrilegious to touch Mozart’s music, no matter what the director’s vision is calling for. Here, on the other hand, it’s often the case that we don’t get to sing a song all the way through because the rhythm of the scene doesn’t allow it, and on a couple of occasions we get interrupted mid-verse by stage events. Apart from that, everything feels quite natural and I’m really happy and excited to be doing theatre again.

Another thing I’m very happy about is that with the Data Governance and Privacy course finished, I have some time to work on things that aren’t my next PowerPoint slideshow for a change. Like writing papers! There’s one I’ve been itching to get started on for a good while now, and it looks like now is finally the time. I’m also supposed to be working in a couple of projects besides Reboot Skills, and “no updates from me” is a phrase I’ve had to use a bit too frequently in meetings of late. Who knows – maybe there’ll be more papers to write once I’ve reminded myself what it is that I’m meant to be doing in those projects…

Summing up the AI summit

The end of the year is approaching fast, with Christmas now barely two weeks away, but I managed to fit in one more virtual event to top off this year of virtual events: the Tortoise Global AI Summit. To be quite honest, I wasn’t actually planning to attend – didn’t even know it was happening – but a colleague messaged me the previous day, suggesting that it might be relevant to my interests and also that the top brass would appreciate some kind of executive summary for the benefit of the Faculty. Despite the short notice I had most of the day free from other engagements, and since the agenda did indeed look interesting, I decided to register and check it out – hope this blog post is close enough to what the Dean had in mind! 

I liked the format of the event, a series of panel discussions rather than a series of presentations. Even the opening keynote with Oxford’s Sir Nigel Shadbolt was organised as a one-on-one chat between Sir Nigel and Tortoise’s James Harding, which felt more natural in an online environment than the traditional “one person speaks, everyone else listens, Q&A afterward” style. Something that worked particularly well was the parallel discussion on the chat, to which anyone attending the event could contribute and from which the moderators would from time to time pick questions or comments to be discussed with the main speakers. Overall, I was left with the feeling that this is the way forward with virtual events: design the format around the strengths of online instead of trying to replicate the format of an offline event using tools that are not (yet) all that great for such a purpose. 

The keynote set the tone for the rest of the event, bringing up a number of themes that would be discussed further in the upcoming sessions: the hype around AI versus the reality, transparency of AI algorithms and AI-based decision making, AI education – fostering AI talent in potential future professionals and data/algorithm literacy in the general populace – and the need for data architectures designed to respect the ethical rights of data subjects. Unhealthy power concentrations and how to avoid them was a topic that resonated with the audience, and it shouldn’t be too hard to think of a few examples of such concentrations. The carbon footprint of running AI software was brought up on the chat. Perhaps my favourite bit of the session was Sir Nigel’s point that there is a need for institutional and regulatory innovations, which he illustrated by way of analogy by mentioning the limited company as a historical example of an institutional innovation. Such innovations are perhaps more easily overlooked than scientific and technological ones, but one can hardly deny that they, too, have changed the world and will continue to do so.

The world according to Tortoise

The second session was about the new edition of the Tortoise Global AI Index, which ranks 62 countries of the world on their strength in AI capacity, defined as comprising the three pillars of implementation, innovation and investment. These are further divided into the seven sub-pillars of talent, infrastructure, operating environment, research, development, government strategy and commercial, and the overall score of each country is based on a total of 143 individual indicators. The scores are normalised such that the top country gets an overall score of 100, and it’s no big surprise that said country is the United States, as it was last year when the index was launched. China and the United Kingdom similarly retain their places as no. 2 and no. 3, respectively. China has closed some of the gap with the US but is still quite far behind with a score of 62, while the UK, sitting at around 40, has lost some of its edge over the challengers. Canada, Israel, Germany, the Netherlands, South Korea, France and Singapore complete the top 10. 

Finland is just out of the top 10 but rising, up three places from 14th to 11th. According to the index, Finland’s particular forte is government strategy, comprising indicators such as the existence of a national AI strategy signed by a senior member of government and the amount of dedicated spending aimed at building AI capacity. In this particular category Finland is ranked 5th in the world. Research (9th) and operating environment (11th) can also be counted among Finland’s strengths, and all of its other subrankings (talent – 16th, commercial – 19th, infrastructure – 21st, development – 22nd) are solidly above the median as well. Interestingly, the US, while being ranked 1st in four categories and in the top 10 for all but one, is only 44th on operating environment. The most heavily weighted indicator here is the level of data protection legislation, giving countries covered by the GDPR a bit of an edge; 7 of the top 10 in this category are indeed EU countries, but there is also, for instance, China in 6th place, so commitment to privacy is clearly not the whole story. 

There was some good discussion on the methodology of the AI index, such as the selection of indicators. For example, one could question the rather heavy bias toward LinkedIn as a source of indicators for AI talent. Another interesting point raised was that while we tend to consider academics mainly in terms of their affiliation, it might also be instructive to look at their nationality. Indeed, the hows and whys of the compilation of the index would easily make for a dedicated blog post, or even a series of posts, but I’ll leave it for others to produce a proper critique. For those who are interested, a methodology report is available online. 

From the Global AI Index the conversation transitioned smoothly into the next session on the geopolitics of AI, where one of the themes discussed was if countries should be viewed as competing against one another in AI, or if AI should rather be seen as an area of international collaboration for the benefit of citizens everywhere. Is there an AI race, like there once was a space race? Is mastery of AI a strategic consideration? Benedict Evans advocated the position that to talk about AI strategy is to adopt a wrong level of abstraction, and that AI (or rather machine learning) is just a particular way of creating software that in about ten years’ time will be like relational databases are today: so ubiquitous and mundane that we hardly pay any attention to it. This was in stark contrast to the view put forward in the beginning of the session that AI is a general-purpose technology akin to electricity, with comparable potential to revolutionise society. The session was largely dominated by this dialectic, but there was still time for other themes as well, such as the nature of AI clusters in a world where geographically limited technology clusters are becoming an outdated concept, and the role of so-called digital plumbing in providing the essential foundation for the success of today’s corporate AI power players.

Winners and losers

The next session, titled “AI’s ugly underbelly”, started by taking a look at an oft-forgotten part of the AI workforce, the people who label data so that it can be used to train machine learning models. It’s been estimated that data labelling accounts for 25% of the total project time in an ML project, but the labellers are, from the perspective of the company running the project, an anonymous mass employed through crowdsourcing platforms such as MTurk. In academic research the labellers are often found closer to home – the job is likely to be done by your students and/or yourself, and when crowdsourcing is used, people may well be willing to volunteer for the sake of contributing to science, such as in the case of the Zooniverse projects. In business it’s a different story, and there is some money to be made by labelling data for companies, but not a lot; it’s an unskilled job that obeys the logic of the gig economy, where the individual worker must buy their own equipment and has very little in the way of job security or career prospects. 

The subtitle of this session was “winners and losers of the workforce”, the winners of course being the highly skilled professionals who are in increasingly high demand and therefore increasingly highly paid. There was a good deal of discussion on the gender imbalance among such people, reflecting a similar imbalance in the distribution of the sort of hard (STEM) skills usually associated with tech jobs. In labelling the gap is apparently much narrower, in some countries even nonexistent. It was argued that relevant soft skills and potential AI talent are distributed considerably more evenly, and that companies trying to find people for AI-related roles may want to look beyond the traditional recruiting pathways for such roles. A minor point that I found thought-provoking was that recruiting is one of the application domains of AI, so the AI of today is involved in selecting the people who will build the AI of tomorrow – and we know, of course, that AI can be biased. One of the speakers brought up the analogy that training an AI is like training a dog in that the training may appear to be a success, but you cannot be sure of what it is that you’ve actually trained it to respond to. 

There was more talk about AI bias in the “AI you can trust” session, starting with what we mean by the term in the first place. We can all surely agree that AI should be fair, but can we agree on what kind of fairness we want – does it involve positive discrimination, for example? Bias in datasets is a relatively straightforward concept, but beyond that things get less tidy and more ambiguous. There is also the question of how we can trust that an AI is not biased, provided that we can agree on the definition; a suggested solution is to have algorithms audited by a third party, which could provide a way to strike a balance between the right of individuals to know what kind of decision-making processes they are being subjected to and the right of organisations to keep their algorithms confidential. An idea that I found particularly interesting, put forth by Carissa Véliz of the Institute for Ethics in AI, was that algorithms should be made to undergo a randomised controlled trial before they are allowed to make decisions that have a serious, potentially even ruinous, effect on people’s lives. 

Data protection was, of course, another big topic in this session. That personal data should be handled responsibly is again something we can all agree on, but there was a good deal of debate on what is the proper way to regulate companies to ensure that they are willing and able to shoulder that responsibility. Should they be told how to behave in a top-down manner, or is it better to adopt a bottom-up strategy and empower individuals to look after their own interests when it comes to privacy? Is self-regulation an option? The data subject rights guaranteed by the GDPR represent the bottom-up approach and are, in my opinion, a major step in the right direction, but it’s also a matter of having effective means to enforce those rights, and here, I feel, there is still a lot of work to be done. The GDPR, of course, only covers the countries of the EU and the EEA, and it was suggested that perhaps we need an international organisation for the harmonisation of data protection, a “UN of data” – a tall order for sure, but one worth considering.

Grand finale

The final session, titled “AI: the breakthroughs that will shape your life”, included several callbacks to themes discussed in previous sessions, such as the growth of the carbon footprint of AI as the computational cost of new breakthroughs continues to increase – doubling almost every 3 months according to an OpenAI statistic. The summit took place just days after the announcement of a great advance achieved by DeepMind’s AlphaFold AI in solving the protein folding problem in computational biochemistry, mentioned already in the beginning of the first session and discussed further here. While it was pointed out that the DeepMind solution is not necessarily the end-all it has been hailed as, it certainly serves to demonstrate that the technology is good for tackling serious scientific problems and not just for mastering board games. The subject of crowdsourcing came up again in this context, as the approach has been applied to the folding problem with some success in the form of Folding@home, where the home computers of volunteers are used to run distributed computations, as well as Foldit, a puzzle video game that essentially harnesses the volunteers’ brains to do the computations. 

There was some debate on the place of humans in a society increasingly permeated by AI systems, particularly on where we want to draw the line on AI autonomy and whether new jobs created by AI will be enough to compensate for old ones replaced by AI. Somewhat ironically, data labeller is a job created by AI that may already be on its way to being made obsolete by advances in AI techniques that do not require large quantities of labelled data for training. One of the speakers, Connecterra founder Yasir Khokhar, talked about the role of AI in solving the problem of feeding the world, reminding me of Risto Miikkulainen’s keynote talk at CEC 2019, in which he presented agriculture as one of the application domains of creative AI through evolutionary computation. OpenAI’s GPT-3 was then brought up as another example of a recent breakthrough, leading to a discussion on how we tend to anthropomorphise our Siris and Alexas and to ascribe human thought processes to entities that merely exhibit some semblance of them. There was a callback to AI ethics here when someone asked whether we have the right to know when we are interacting with an AI – if we’re concerned about AI transparency, then arguably being aware that there is an AI is the most basic level of it. Of things that are still in the future, the impact of quantum computing on AI was discussed, as were the age-old themes of artificial general intelligence and rogue AI as existential risk, but in the time available it wasn’t feasible to come to any real conclusions. 

Inevitably, it got harder to stay alert and focused as the afternoon wore on, and I also missed the beginning of one session because I had to attend another (albeit very brief) meeting, but even so, I managed to gather a good amount of interesting ideas and information over the course of the day. I’m particularly happy that I got a lot of material on the social implications of AI that we should be able to use when developing our upcoming AI ethics course, since so far I haven’t been too clear about specific topics related to this aspect of AI that we could discuss in the lectures. This wasn’t a week too soon, I might add – we’re due to start teaching that course in March, so it’s time to get cracking on the preparations!

Collaboration, schmollaboration

Whenever someone asks me what my research project is about, I usually open by saying we’re calling it collaborative knowledge discovery from data. That’s a nice, convenient way of putting it in a nutshell, but it immediately calls for some elaboration, especially on the meaning of the term “collaborative”. Technically, any activity that involves two or more people working together toward a common goal is collaborative, but this definition doesn’t get us very far, because in knowledge discovery you typically have at least someone who knows about the technology and someone who knows about the application domain. It’s not unheard of for one person to know about both, but still, I think it’s safe to say that collaboration is the rule rather than the exception here.

To narrow it down a bit, the kind of collaboration we’re talking about is remote and synchronous. In other words, the participants are not located in the same place, but they can all simultaneously edit whatever it is they’re collaborating on and see the effects of each other’s edits in real time. This implies that there must be some kind of online environment where the collaboration takes place; think something like Google Docs or MS Office Online, only for KDD artifacts such as datasets, algorithms and processing pipelines.

Even this is not a particularly novel idea in itself, as there are collaboration platforms already available where you can develop just these sorts of things. Therefore in KDD-CHASER we’re upping the ante even further by focusing specifically on collaborative knowledge discovery from personal data, driven by the data owner who cannot be assumed to have any particular technology or domain expertise. It’s a bit of a niche, which of course makes our eventual results somewhat less generalisable, but it also makes it considerably easier to spot opportunities for novel research contributions.

To me, the most interesting problems here are not necessarily associated with knowledge discovery as such but with the things that need to happen before knowledge discovery can take place. After all, from the data owner’s perspective the point of collaborating with experts is basically to have the actual discovery done by people who are better equipped for it in terms of skills and tools. This doesn’t mean, however, that the data owner’s role in the collaboration is limited to being a passive data source; on the contrary, it is the data owner’s needs that drive the entire process of collaborative KDD in the special case we’re considering.

The first problem that a data owner may encounter on the way to a successful collaboration is that they don’t even know anyone they could collaborate with, so the first thing the collaboration platform should do is provide a matchmaking service that brings together people who have data with people who have the right sort of expertise to help turn it into something more valuable. After the matchmaking follows the really interesting part: negotiation. What kind of knowledge is the data owner interested in? What is actually achievable, given the available data and the extent to which the data owner is willing to share it with the expert? What is the expert expecting to get in compensation for their efforts? The collaborators need to find the answers to such questions among themselves, and the collaboration platform should support them in this.

The bare minimum is to provide the collaborators with some kind of communication channel, but this is something that would be required anyway, and it’s hardly a research problem from a computing standpoint. However, there’s a lot more to negotiation than just talking, and I’m interested to see what I might do to help things along in this area. Privacy, for example, is traditionally close to my heart and something that I want to address also here, because one of the things to be determined through negotiation is how much of their data the data owner is prepared to trust their collaborators with, considering that the latter may be the KDD equivalent of someone they just matched with on Tinder.

It’s been pretty clear from the start that whatever we manage to accomplish in my current project, it’s not going to be a comprehensive solution to all the problems of collaborative KDD, even within the niche we’ve carved for ourselves. What we can realistically shoot for, though, is a model that shows us what the collaboration process looks like and gives us an understanding of where the major problems are. The software I’m building will basically be a collection of candidate solutions to a select few of these problems, and it will hopefully be something I can continue to build on when my MSCA fellowship is over.

Busy times

With the end-of-year holidays approaching, things tend to get busy in a lot of places, not just in Santa’s workshop. My life in Ireland is no exception: there are five major work-related (or at least university-related) things that I’ve been trying my best to juggle through November, with varying success. Many of these will culminate over the next two weeks or so, so after that I’m hoping it will be comparatively smooth sailing till I leave for my well-deserved Christmas break in Finland. The blog I’m not even counting among the five and I’ve been pretty much neglecting it, so this post is rather overdue, and also a welcome break from all of the more pressing stuff that I should really be working on right now.

One area where I’ve had my hands full is data protection, where it seems that whenever a document is finished, there’s always another one to be prepared and submitted for evaluation. Getting a green light from the Research Ethics Committee was a big step forward, but there’s now one more hurdle left to overcome in the form of a Data Protection Impact Assessment. I’m very much learning (and making up) all of this as I go along, and the learning curve has proved a rather more slippery climb than I expected, but I’m getting there. In fact, I’m apparently one of the first to go through this process around here, so I guess I’m not the only one trying to learn how it works. I hope this means that things will be easier for those who come after me.

Meanwhile, I’ve been preparing to give my very first lecture here at DCU – thankfully, just one guest lecture and not a whole course, but even that is quite enough to rack my nerves. It is a little strange that this should be the case, even after all the public speaking I’ve had to do during my fifteen-plus years in research, but the fact of the matter is that it does still feel like a bit of an ordeal every time. Of course it doesn’t help that I’m in a new environment now, and also I’ll be speaking to undergraduate students, which is rather different from giving a presentation at a conference to other researchers. Still, I’m not entirely unfamiliar with this type of audience, and I can recycle some of the lecture materials I created and used in Oulu, so I think I’m going to be all right.

Speaking of conferences, I’m serving in the programme committee of the International Conference on Health Informatics for the second year running and the manuscript reviewing period is currently ongoing, so that’s another thing that’s claimed a sizable chunk of my time recently. Somewhere among all of this I’m somehow managing to fit in a bit of actual research as well, although it’s nowhere near as much as I’d like, but I guess we’ve all been there. The software platform is taking shape towards a minimum viable product of sorts, and I have a couple of ideas for papers I want to write in the near future, so there’s a clear sense of moving forward despite all the other stuff going on.

So what’s the fifth thing, you ask? Well, I’ve rekindled my relationship with choral singing by joining the DCU Campus Choir, having not sung in a proper choir since school. Despite the 20-year gap (plus a bit), I haven’t had much trouble getting into it again: I can still read music, I can still hit the bass notes, and I don’t have all that much to occupy myself in the evenings and weekends so I have plenty of time to learn my parts (although I’m not sure how happy my neighbours are about it). The material we’re doing is nice and varied, and the level of ambition is certainly sufficient, as it seems like we’re constantly running out of rehearsal time before one performance or other. Our next concert will be Carols by Candlelight at DCU’s All Hallows campus on the evening of Monday the 10th of December, so anyone reading this who’s in town that day is very warmly welcome to listen!

A Solid foundation for social apps?

Tim Berners-Lee recently posted an open letter on the web, announcing the launch of Solid, a new technology platform that he and his team at MIT have been working on for the past few years, to the wider online community. Like a lot of people these days, he’s not too happy about the way our personal data is being controlled and exploited by providers of online services, but when the father of the web is telling you how it’s not gone the way he intended, you may want to prick up your ears even if you personally have no problem with the way things are. Not only that, but when he says he’s come up with something that we can use to set things right, it’s probably worth checking out.

We’ve all seen the headlines that result when a company with a business model based on aggregating and monetising personal data gets negligent or unscrupulous with the data in its possession, but these incidents are really just symptoms of a more fundamental issue concerning the architecture of basically every popular online social application out there. Even if we imagine a perfect world of ideal application providers that are completely open and honest about how they use your data and never suffer any security breaches, the fact remains that they, not you, control the data you’ve given them. You still own it, yes, but they control it.

Why is this an important distinction? The answer has to do with the coupling of your data with the specific services you’re using: you can’t have one without the other. As a result, your data is broken up into pieces that are kept in separate bins, one for each service, even when it would be really to helpful to have it all in the same place. If you want to use several services that all use the same data, you have to upload it to each one separately, and that’s assuming that you have or can get the data in a reusable format, which isn’t always the case. It would make a lot more sense to have just a single copy of the data and permit the services to access that – within privacy parameters that you have complete control of – and it would be even better if you could move your data to a different location without breaking all those services that depend on it.

Sound good? Well, the people behind Solid apparently want you to be able to do just that. Their proposed solution is based on decoupling data from applications and storing it in units called PODs (short for personal online data store). Applications built on the Solid platform can access the data in your POD if you give them permission to do so, but they don’t control the data, so they can’t impose any artificial restrictions on how you use, combine and reuse data from different sources. The end-users of Solid are thus empowered to make the best possible use of their data while retaining full control of what data they disclose and to whom, which is very much what I’m aiming for in my own research; I can easily see collaborative knowledge discovery as an app implemented on Solid or some similar platform.

So that’s the theory, anyway. What about reality? I can’t claim to have examined the platform in great depth, but at least on the surface, there are a number of things that I like about it. It’s built on established W3C specifications in what looks like a rather elegant way where linked data technologies are used to identify data resources and to represent semantic links between them – for example, between a photo published by one user and a comment on the photo posted by another. Besides your data, your POD also holds your identity that you use to access various resources, somewhat like you can now use your Google or Facebook credentials to log in to other vendors’ services, but without the dependence on a specific service to authenticate your identity. Of course, you still need to get your Solid POD from somewhere, but you’re free to choose whichever provider suits you best, or even to set up your own Solid server if you have the motivation and the means.

Whether Solid will catch on as a platform for a new class of social web apps is not just a matter of whether it is technically up to the challenge, though. The point of social media is very much to have everyone in your social network using the same applications, so the early adopters won’t have much of an impact if their friends decide that it’s just so much more convenient to keep using the apps where they already have all their connections and content rather than to switch platforms and build everything up all over again – which, of course, is precisely the sort of thinking the providers of those apps are counting on and actively reinforcing. People like me may give Solid a go out of sheer curiosity, but I suspect that the majority can’t be bothered unless there are Solid apps available that let them do things they really want to do but haven’t been able to before. Taking control of your own data is a noble principle for sure, but is it enough to attract a critical mass of users?

Then there’s the question of how the Solid ecosystem will work from a business perspective. The supply of interesting applications is going to be quite limited unless there’s money to be made by developing them, and presumably the revenue-generation models of centralised social apps can’t be simply dropped in a decentralised environment such as Solid without any modifications. We pretty much take it for granted now that we can “pay” for certain kinds of services through the act of using them and generating data for the service provider to use as raw material for services that the provider’s actual customers will pay good money for, but would – and should – this work if the provider could no longer control the data? On the other hand, would we be willing to pay for these services in cash rather than data, now that we’ve grown so used to getting them for “free”? Then again, there was a time when it was not at all clear how some of today’s multi-billion-dollar companies were ever going to turn a profit, so maybe we just need the right sort of minds to take an interest for these things to get figured out.

It’s also worth noting that Solid is by no means the only project aiming to make the web less centralised and more collaborative. There is a substantial community of researchers and developers working on solutions to various problems in this area, as evidenced by the fact that Solid is but one of dozens of projects showcased at the recent DWeb Summit in San Francisco, so it may well turn out that even if Solid itself won’t take off, some other similar thing will. I won’t be betting any money on any of the contenders just yet, but I probably will get myself a Solid POD to play with so I can get a better idea of what you can do with it.

First blood

Time to look at the first results from my project! Well, not quite – the first results are in a literature survey I did immediately after starting the project and made into a journal manuscript. I’m currently waiting for the first round of reviews to come in, but in the meantime I’ve been busy developing my ideas about collaborative knowledge discovery into something a bit more concrete. In particular, I’ve been thinking about one of the potential obstacles to effective collaboration from the data owner’s perspective: privacy.

In the aftermath of the much publicised Facebook–Cambridge Analytica scandal, one would at least hope that people are becoming more wary about sharing their personal data online. On the other hand, with the General Data Protection Regulation in full effect since 25 May, a huge number of people are now covered by a piece of legislation that grants them an extensive set of personal data control rights and has the power to hurt even really big players (like Facebook) if they don’t respect those rights. Of course, it’s still up to the people to actually exercise their rights, which may or may not happen, but after all the GDPR news, emails and “we use cookies” notices on websites, they should be at least vaguely aware that they have them.

The increased awareness of threats to privacy online and the assertion of individuals, rather than corporations, as the owners of their personal data are welcome developments, and I like to think that what I’m trying to accomplish is well aligned with these themes. After all, the collaborative knowledge discovery platform I’m building is intended to empower individual data owners: to help them extract knowledge from their own data for their own benefit. This does not make the privacy issue a trivial one, however – in fact, I wouldn’t be surprised if it turned out that people are more uneasy about sharing a small portion their data with an individual analyst focusing on their case specifically than about using an online service that grabs and mines all the data it can but does so in a completely impersonal manner. The platform will need to address this issue somehow lest it end up defeating its own purpose.

The angle from which I decided to approach the problem involves using a domain ontology and a semantic reasoner, which are technologies that I had been interested in for quite some time but hadn’t really done anything with. As I was doing the literature survey, I became increasingly convinced that an underlying ontology would be one of the key building blocks of the new platform, but it was also clear to me that I would need to start by modelling some individual aspect of collaboration as a proof of concept, so that I would fail fast if it came to that. If I started working top-down to produce a comprehensive representation of the entire domain, in the worst case I might take ages to discover nothing but that it wasn’t a very viable approach after all.

All this came together somewhat serendipitously when I found out that the 2nd International Workshop on Personal Analytics and Privacy (PAP 2018), held in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2018) in Dublin, had an open call for papers. The submission deadline was coming up in a month – enough time to put together some tentative results, though nothing hugely impressive – and coincided rather nicely with the date when I was planning to fly to Finland for my summer holidays. In about two weeks I had the first version of the manuscript ready, with another two left over for revisions.

The ontology I designed is based on the idea of a data owner and a data analyst (or possibly any number of either) using the collaborative knowledge discovery platform to negotiate the terms of their collaboration. Each uses the platform to specify requirements, but from opposing perspectives: the data analyst specifies analysis tasks, which require certain data items as input, while the data owner specifies privacy constraints, which prevent certain data items from being released to the data analyst. The data owners, data analysts, data items, analysis tasks and privacy constraints are all registered as individuals in the ontology and linked with one another such that a reasoner is able to use this information to detect conflicts, that is, situations where a data item is required for a data analysis task but not released by the data owner.

To resolve such conflicts, the data owner and the data analyst may, for example, agree that the analyst receives a version of the dataset from which the most sensitive information has been removed. Removing information reduces the utility of the data, but does not necessarily make it completely useless; finding a balance where the data owner’s privacy preferences are satisfied while the data analyst still gets enough material to work with is the essence of the negotiation process. The ontology is meant to support this process by not just pointing out conflicts, but by suggesting possible resolutions based on recorded knowledge about the utility effects of different methods of transforming data to make it less sensitive.

For the PAP workshop paper, I only had time to design the logic of conflict detection in any detail, and there also was no time to test the ontology in a real-world scenario or even a plausible approximation of one. It therefore hardly seems unfair that although the paper was accepted for a short oral presentation at the workshop, it was not accepted for inclusion in the post-proceedings. Obviously it would have been nicer to get a proper publication out of it, but I decided to go ahead and give the presentation anyway – ECML-PKDD is the sort of conference I might have gone to even if I didn’t have anything to present, and since the venue is a 25-minute walk away from my house, the only cost was the registration fee, which I could easily afford from the rather generous allowance for sundry expenses that came with the MSCA fellowship.

Croke Park may seem like an unlikely place to have a conference, but it is in fact a conference centre as well as a stadium, and seems to work perfectly well as a venue for an academic event – meeting spaces, catering and all. Besides Croke Park, we had Mansion House for the welcome reception and Taylor’s Three Rock for the conference banquet, so can’t complain about the locations. The regular programme was quite heavy on algorithms, which isn’t really my number one area of interest, but I did manage to catch some interesting application-oriented papers and software demos. What I enjoyed the most, however, were the keynote talks by Corinna Cortes, Misha Bilenko and Aris Gionis; there were two others that I’m sure I also would have found very interesting but was unable to attend, because there was a rather important deadline coming up and so I had to zig-zag between Croke Park and DCU to make sure I got everything finished on time.

My own talk went reasonably well I felt, with an audience of about twenty and some useful discussion afterwards on how I might go about modelling and quantifying the concept of utility reduction. On the last day of the conference, which was today, I went to another workshop, the 3rd Workshop on Data Science for Social Good (SoGood 2018), with presentations on how machine learning and data mining techniques can be used to address societal issues such as homelessness and corruption. I especially enjoyed the last one, if enjoy is the right word – it dealt with efforts to combat human trafficking by means of data science, certainly a worthy cause if ever there was one, but also rife with difficulties from the scarcity of good input data to the nigh-impossibility of devising an ethically justifiable experiment when there are literally lives at stake. Plenty of food for thought there, and a fine way to finish off this week of conference activities; on Monday it’s back to business as usual.