Never before has big data weighed so heavily on Australians’ minds as now, with the government releasing a new app to try and minimise the rate of community transmission of COVID-19 through contact tracing. The new app has reignited the privacy debate about whether it is wise to give up so much personal data in exchange for a service – or in this case – virus protection. In some ways – the debate is lost before it even begins – with most Australians keen to give away their data willy-nilly to the likes of Apple, Google and Amazon. You would be hard-pressed to find someone who doesn’t use some sort of Google product – whether it be search, Maps, Gmail or Drive.
Unfortunately, most of us, at some level or the other, have allowed tech giants the systemic right to strip search the data in our lives in exchange for the “free” services they offer – famously termed as ‘Surveillance Capitalism’ by Shoshana Zuboff in her bestselling masterwork, The Age of Surveillance Capitalism.
But perhaps the debate has reignited on the basis that it is the government wanting the data, rather than a private company providing a commercial service.
What people may not know, however, is that – in a roundabout way – most data can end up in the same place. When you think about it, there are very few large cloud players. It is mainly Google, Amazon Web Services (the platform the COVID-19 tracker will use) and Microsoft through its Azure platform. So, no matter who actually owns the data, it sits on one of the few platforms.
When it comes to harnessing the power of big data, what is it that we need to consider to ensure it’s being used and collected ethically, while still delivering the desired outcome?
Capturing the data
The first point to consider is how the data is captured, as the identifiability of information is often dependent on the type of information that is being recorded. Pieces of data collected “by stealth” are rarely being put to ethical use. There are many pieces of data or metadata that can be recorded to provide analytics. Things like IP addresses – the series numbers allocated to your device on a network; MAC addresses – the unique code assigned to each internet-connected device; the plugins you have installed on your browser; cookies of third party sites which track clicking from site to site; or geographical location. Some companies make their money by aggregating and selling this data to give you a “digital fingerprint”, which can be used to identify you.
These sets of data points can’t necessarily identify a person on their own but can uncloak an identity when overlayed with a second dataset. For example, if a mobile phone provider were to issue an iPhone to a user, that iPhone would have a unique MAC address that the telco would know was issued to that user. Someone in the mobile phone provider’s data analytics team would also see that MAC address appear on their system when the device was turned on, but they would not be able to identify the individual user. However, if they were to cross-reference the MAC address with the list of phones on the network and with the sales database, they could then pinpoint that particular iPhone to an individual.
Another example could be a driver in a GPS-connected car travelling from one location to another. If you know of a person who lives at that address, then you may be able to uncloak his or her identity. But without that additional identity information, it’s just an unknown person travelling from point A to B.
All this data can be captured and accessed by anyone with the right know-how. But there are additional measures that can be put in place to shore up data privacy and security. One way to ensure privacy is to simply not capture all of the metadata. For example, when tracing a phone handset, by not recording the MAC address, no one can ascertain exactly which device accessed a network, making it harder to identify the device’s user. For the road user, it may be excluding some information at the point A location (like ‘social fitness network’ Stravadoes with “Zones of Privacy”). This means that the individual can’t be identified as no one knows precisely where they came from – just that they came from a general area and travelled to destination X.
While providing a reasonable degree of identity protection, the downside of not recording such data is that it makes it much more challenging to realise the benefits of that information, which can include additional analytics or services enabled by that specific data. For example, if we don’t record the MAC address, we don’t know whether a device logged onto a network later is the same device, or whether it is an entirely different one. Likewise, if we don’t have a specific location for when a driver clocks on, we can’t predict traffic patterns as accurately, or we may not be able to identify a car moving only a small distance around a car park.
However, there is a ‘reasonably good’ solution to the conundrum of realising data benefits, while protecting individual identity, which is where alias IDs come in. In IT circles, the approach is called “pseudo-anonymity”.
Removing unique identifiers
Alias IDs allow for a secondary unique identifier to come into play. A secondary unique identifier masks the primary identifier – namely the identifier most likely to be traced back to the individual – by giving it another form. A good way to think about this is like a customer or member number that organisations give you to identify your account, ahead of using more easily identifiable information, like your name, date of birth, or home address. But, it is still easy to look up your member number and quote that to a call centre operator. Instead of just using that ‘member identifier’, a second, very hard to guess, identifier is issued, and all IT systems refer to your data using this other identifier.
Your member number on its own means nothing unless it is matched to your name. But it still allows someone to know your health care number, or Medicare number, credit card number, or insurance policy. We want an identifier that does not publicly expose these IDs to the world.
A GUID (Globally Unique Identifier) generator can generate a large number of unique identification numbers with a low risk of duplicates. An example of an online GUID generator can be found here.
A GUID generator can have the capacity to generate 600 million GUIDs at any given time, which means if everyone on earth tried to generate a unique GUID on the same platform at the same time, there is still only a one in two chance of even a single duplicate being generated.
An example of a GUID looks like this: 3587050c-012b-42a0-bbc0-2285cf37c449.
You’ve probably seen these from time to time. Next time you see one, think about what that is – and the unique thing it identifies.
There are many practical uses for GUIDs – one example could be a food delivery system. A food delivery company makes a record in their database for your order (order 534, for example) and wants to retain this information in their accounting system forever. At the same time, they want to let the restaurant, the delivery guy and the customer to see the information – but they don’t want to give out the same database ID to multiple people, as it may allow a hacker to identify an individual easily.
So, they issue two GUIDS to track the order – one to the restaurant (e.g. 86875904-4e69-4610-8eb9-14dffcdd839a) and one to the delivery driver and customer (e.g. 2ebc325b-4a2c-4578-b345-759330052441). The restaurant gets one unique link (e.g. http://orders.mydeliveryservice.com/86875904-4e69-4610-8eb9-14dffcdd839a), and the customer waiting for the delivery gets another unique link (e.g. http://deliverytracker.mydeliveryservice.com/ 2ebc325b-4a2c-4578-b345-759330052441) to track the order. These GUIDs, however, are both just temporary alias IDs for the original order 534 that last just long enough for the order to complete. Once the order is fulfilled and delivered, both the restaurant and customer are notified, and the GUIDs deleted.
Now, if either the restaurant, the delivery driver or the customer, try to access the order the next day using those same GUIDs, it is impossible. Those GUIDs have been permanently removed – no one can access them. Yet, the food delivery accounting system has order 534 still stored in it for accounting purposes. With the identifiers removed, it makes it extremely difficult to hack into the system and harvest sensitive information that can be traced to an individual. To add an extra level of protection, the restaurant isn’t given the same identifier as the customer – so if the restaurant system is hacked, customers are also protected.
Compliance with privacy laws
The COVIDSafe app operates similarly. It collects people’s phone numbers and stores them in a backend system.
There is an extra level of protection here though – through the app, the government only has access to your phone number, but no other information that can be traced to the individual; not even a first name. If the government wants to attribute an individual to a phone number through the app’s back end, they need to get the phone number, a court order and then a warrant to get the phone provider to reveal your identity. This is where the power of pseudo-anonymity lies – all the necessary information is collected, but only those authorised to do so can access it and trace it to individuals.
This is an essential approach for complying with EU privacy laws. The General Data Protection Regulation (GDPR) stipulates that any user may request the data that any company has captured on them, and by extension, has the right to request that this data be deleted. For many organisations, this can be an administrative nightmare as not only do they have to remove the data on their own systems but often have to remove it from third-party providers or vendors systems. But if the data can be de-identified through secondary unique identifiers, then the data is not necessarily attributed to that person and can save organisations significant time on administrative tasks. Pseudo-anonymity is excellent and makes it easier for developers to keep your information private, but it’s not perfect.
Encrypted data transmission
The other factor that must be taken into consideration is how data is transmitted. There is no point in implementing de-identifying regimes if the data transmitted is not secure. What many people don’t realise is that if a connection is not encrypted, any information transmitted between two devices can be intercepted. Free Wi-Fi hotspots are the biggest risk in that little, or no encryption exists in these networks. While mobile phone networks are slightly safer, risks still exist. Then on top of that, you have the protocol the apps are using to communicate; is that encrypted or un-encrypted.
The growing proliferation of 5G is only going to make the need for encryption more crucial as private networks become direct connections as the ‘core’ moves away from your physical place to one of a virtual space. Just look at what has happened already – fewer and fewer businesses are running their own data centres. With cars, fridges, keys and even pet collars all becoming digital, everything is becoming connected via a public mesh made of devices with SIM Cards in them, rather than through a traditional home router.
With technology rapidly evolving, these concepts must be dissected, debated and implemented to ensure community trust in data-provided services. If we can get this right, we can harness the true power of big data and deliver significant benefit to all of society, and improve our daily lives.
Article written by Jason Lowder former Intelematics employee and contributor of the Intelematics Thought Leaders Club.