There's been a lot of talk lately about psychological or psychographic profiling based on social-media data. How does it work? (And does it work?!)
It's worth noting right off the bat that companies like Cambridge Analytica, Facebook, Google, and Twitter typically talk out of both sides of their mouths. When selling their product to potential clients, all of these companies describe magical powers. When talking to critics and regulators, all of the sudden their product is infantile, limited, and hardly impactful at all.
That disconnect is important. One (or both) of those characterizations isn't true. And whether they're lying to their clients, their users, or regulators, that's a problem.
But where's the truth? (How) does it all work?
Cambridge Analytica doesn't give the recipe for their secret sauce. No one does. But these groups don't have secret proprietary algorithms. They make unique applications of existing science. And knowing the data they have and the science that exists means we can piece things together. And given what we know about the social media data groups like Cambridge Analytica have access to, and what we know they're doing with it, it looks a lot like an information recommendation system.
Think Pandora (though Netflix, Facebook, Twitter, YouTube, Amazon all do something similar). As you listen to music, Pandora collects data about your listening habits: songs listened to, skipped, paused; thumbs up, thumbs down; playlists and stations created, etc. In addition to Pandora's user behavior data, Pandora also has specific song feature data from its Music Genome Project — ratings for each song according to features like "fast tempo", "guitar-driven", "protest-themed lyrics", even things like "danceability".
Pandora needs to connect user-behavior data to song-feature data in order to decide what song to play next. They do this by creating user "profiles" that tell them a users preference (or not) for songs containing various features (and combinations of features). These profiles are usually incomplete. I've never listened to country music on Pandora, for example, so it doesn't know my taste within that genre. However, by comparing the data they do have on me to other user profiles, they can fill in the gaps with data from similar users. Then the algorithm puts songs in my playlist that possess the features I'm most likely to like, and don't possess the features I'm prone to dislike.
Targeted political content works similarly. They have access to voting records (comparable to liking/disliking/skipping songs), and by connecting to social platforms like Facebook (via apps or ads), they collect data on what we post, like, dislike, who we're friends with, etc. And thanks to the ubiquitous Like button across the web, they can also track what users do on other sites. This data (and any other data they can get their hands on) allows them to build data profiles for each of us, and they can use those profiles to find the features most strongly associated with different types of voters.
In addition to classifying us as Trump voters, Clinton voters, and the like, they can profile the "lifelong Republican", the "sometimes voter", the "fair-weather fan", the "swing voter", etc. With enough data over a long enough period of time, they can also identify the information most likely to motivate a "sometimes Democrat" to vote for Clinton, or a "tentative Democrat" to stay home.
As a data scientist I'm skeptical of the most sweeping claims from these companies' executives and marketing teams, but I'm also wary of the dismissal of the technology, too. Marketers have decades of experience using information and psychology to influence behavior, and online advertisers have access to more usable data and more targeting tools than ever before. Facebook, in particular, is a massive source of user data, and a highly influential bottleneck of information. Mining its data and using that data to game Facebook's information delivery algorithm is a powerful combination.
Free access to information is vital to the life of a democratic republic. Funneling the bulk of our information through a single, gameable (and already gamed) algorithm threatens that democratic process. As both a data scientist and a citizen, I'm concerned about that far beyond the misdeeds of any one company.