14 September 2019

How Data Hoarding Is the New Threat to Privacy and Climate Change

Tyler Elliot Bettilyon
Source Link

Asmachine learning and other data-intensive algorithms proliferate, more organizations are hoarding data in hopes of alchemizing it into something valuable. From spy agencies to network infrastructure providers, data collection is part and parcel of the digital economy. The best data can be combined with clever algorithms to do incredible things — but digital hoarding and computationally-intensive workloads have externalities too.

The electrical costs — and therefore the environmental impacts — of computation are both extraordinary and growing. Modern machine learning (ML) models are a prime example. They require an enormous amount of energy in order to process mountains of data. The computational costs of training ML models have been growing exponentially since 2012, with a doubling period of 18 months, according to OpenAI. In recent months, similar studies have shown that the electrical costs of cryptocurrency and video streaming are also significant and growing.

Producing this electricity creates literal exhaust in most cases — there are precious few server farms running on 100% renewable energy — and with climate change looming large, it’s time we acknowledge the environmental impact of computation. Just like wrapping every little thing in a plastic bag is, some of our CPU usage is frivolous and wasteful.


Computer science and engineering experts have been complaining about this for years. Some point out that we went to the moon with only 4kb of RAM. Others detail how slow and bloated modern software is. Jonathan Blow went so far as to warn about the impending collapse of the entire software engineering discipline due to intergenerational knowledge loss.

Most of the time this argument is positioned in terms of engineering elitism. Its supporters nostalgically harken back to a time when it really meant something to be a software engineer. They scold beginners for not knowing better while flaunting their beautiful hair, tinged with the silvery gray of experience. Despite the condescension, they’re not completely wrong.

As computers got faster and faster, computer programs actually got slower. End-users didn’t notice because the slower programs still ran fast on the faster computers. As a result, many developers rarely have to focus on using memory or CPU cycles efficiently. Our incredible CPUs can run even relatively inefficient code fast enough for most users. Tools and programming languages that prioritize the developer’s time over CPU and memory efficiency have become the norm. AWS and other cloud services epitomize this tradeoff — why spend weeks of development time optimizing the code when Amazon can just automatically turn on a few more servers when we need them.

“More efficient is better,” just doesn’t motivate me the same way as, “we should do our part to conserve electricity, since climate change is an existential threat to humanity.”

There is nothing wrong with professionals trying to hold an industry to high standards. But I do wish the pro-efficiency crowd would use a more persuasive tactic than tautological scolding. Maybe it’s just me, but “more efficient is better,” just doesn’t motivate me the same way as, “we should do our part to conserve electricity since climate change is an existential threat to humanity.” It’s not just about the inefficient electrical use either. The data we generate is itself a kind of digital pollutant — a new kind of trash for the information age.

Some data is a waste product in the same way that junk mail is a waste product. How many computational resources are dedicated to the zillions of spam emails sent every day? How much bandwidth is dedicated to ads sitting unclicked in your sidebar? Increasingly, records of nearly every digital transaction — no matter how trivial — are transmitted to a data center and stored. It may seem hyperbolic to harp on a few wasted bits, but this is a serious problem.

Consider this: Loading Twitter requires about 6mb of data.



Twitter claimed they had about 126 million daily active users in February. If every user loads the homepage just once per day, that represents 756 terabytes of information transmitted every day. Just for Twitter. Add Amazon, Facebook, Google, and all the rest, and we’re talking enormous amounts of data occupying wires, passing through the air, and taking CPU time. What fraction of that data actually delivers real value to the end-user? What fraction of it glides through our screens with complete irrelevance?

All this data requires infrastructure. We need more and faster cables, routers, computers, and phones. We need to upgrade from 4g to 5g. We need to build data centers and server farms. This digital waste results in an ever-increasing amount of always-on physical infrastructure. The amount of land used by server farms is staggering. These electronics are hard to recycle and get worn out more quickly under higher load. The constant process of upgrading and replacing these electronics has created serious environmental and health risks, especially due to the increasing number of discarded electronic components. If we used this infrastructure with greater care — transmitting and storing data as efficiently as possible — we could significantly reduce our infrastructural and electrical needs.

Some of this data is parasitic in nature — it benefits some at the cost of others. Advertisers track us as we browse the internet. Browser extensions harvest our web history. Weather apps track our location. The list goes on. These different aspects of our personal histories are sold to data brokers, who repackage and resell the combined data to third, fourth, and fifth parties. To most people, this data is waste that should just be discarded. Most people will never perform a thorough audit of their internet history, but for advertisers and political strategists, it can be a goldmine. Worse, governments and corporations will continue to fall victim to hackers. These data sources will inevitably fall into the hands of malicious actors.

There is also a vast body of evidence demonstrating that the existence of large data sets — each individually innocuous — can amount to something more dangerous. By correlating information from several different sources, attackers can cobble together a clear profile and use that information to connect more sensitive pieces of data. So many “anonymized” datasets have been compromised using these tactics that some in the field are declaring “anonymization is dead.” These researchers are calling for a new paradigm that prioritizes transparency regarding data collection over attempts to anonymize the data.

And some data — like radioactive waste, used needles, or bloody tissue — is dangerous to even have laying around. Social security numbers, credit card numbers, driver’s license information, or other highly sensitive information should only be stored if it is absolutely necessary to do so, and with special precautions to keep it out of reach of malicious actors.

It is worth saying that there are, of course, a large number of engineers focused keenly on performance optimization and privacy preservation. And much more can still be done. One of the most beautiful aspects of software in the internet age is that we can deploy improvements instantly all around the world. Updates start having an impact immediately, and cuts to processing and data requirements accrue over time.

Just like the fossil-fuel industry, many programming firms have enriched themselves with data while ignoring the externalities of their product.

Returning to our crude estimate from above: If just Twitter cut its page weight by half it would save 378 terabytes of data transmission per day. If every company made an effort to store only absolutely necessary data, and secure that data from malicious actors, we would all be safer from privacy violations. Just like the fossil-fuel industry, many programming firms have enriched themselves with data while ignoring the externalities of their product.

Whether or not companies will start taking those precautions is another question entirely. Which reminds me: don’t forget to claim your $125 Equifax payout.

No comments: