Aya’s lecture on privacy

You should already know who I am. In case you don’t, I am Aya Shameimaru, reporter, writer, editor, and publisher of Bunbunmaru, Gensokyo’s leading newspaper.

Today’s lecture is about privacy issues, which I run into on a daily basis as a journalist.

You ever use an online social networking application like Facebook? When you use Facebook to share information about yourself to your online friends, whoever’s in charge of the application gets to see that information too. Facebook makes money by selling information about you to advertisers.

Facebook and similar sites are not the only places that collect and distribute your data. Erin’s Clinic publishes clinical studies and discharge databases, often less motivated by profit and more motivated by helping research and treatment. Regardless of who is collecting your data, they’ll eventually have a lot of it. Your Scarlet Mansion Library borrower’s history can have tens of thousands of attributes, and your Kourindou purchase history can have millions.

People are unique, and the data collected about them is also unique. The statement “No other profile is more than 30% similar to mine” is true for 90% of the records in the database of Netflix, a famous internet-based video rental service from the outside world.

This has pretty serious implications for protecting your privacy.

This makes it easy to uniquely identify a person from the given data. The number of data points required to identify someone is the log (logarithm) of the number of users of the system. For example, Netflix has 500,000 users and 100 million ratings. If they find four movie ratings and know that they are all done by the same person, they can, on average, uniquely identify the person that made those ratings.

The reason I say “on average” is because some data points help more than others in uniquely identifying a person. For example, if you knew that a certain resident of Gensokyo was female, appeared to be less than 20 years old in human years, had the power to fly, and preferred frilly outfits, that wouldn’t help you much. But if you found out that the mystery person was male and nearsighted, it would narrow your search down to a single person with only two data points.

Of course, the entities who collect your data can’t sell just any information about you without getting in trouble with the law, so they hide bits of the information to meet certain standards of privacy, before they sell it. This is called “anonymization”.

Take for instance this graph of social data. It consists of nodes that represent people in a society, and edges that join two people depending on the relationship between them.

Bunbunmaru News recently acquired (through legitimate means, I assure you) a massive social data graph of all the people in Gensokyo.

Just so you know, Bunbunmaru News is bound by ethical reporting standards, and will never compromise people’s privacy in order to sell newspapers through the appeal of scandal. So I will only publish a section of the graph, and I will be sure to hide the names and censor the pictures.

Click for larger image

Who’s that person in the middle being targeted by romantic desire arrows from no fewer than six people? Think of all the wonderful chaos that would result if the identity of the people in this graph became public knowledge. It’s almost tempting, but I, Aya Shameimaru, have a reputation of integrity to live up to.

However, it turns out that this approach to protecting people’s privacy has flaws. Take a census form from the Scarlet Mansion, which contains name, age, and gender, but no sensitive information. Then take an anonymized patient release record from Erin’s Clinic, which has age, gender, zip code, and sensitive medical information, but no names. If you join those two together on age, gender, and the zip code of the Scarlet Mansion, you would be able to tell, for example, that Patchouli Knowledge of the Scarlet Mansion suffers from asthma.

To better protect people’s privacy, standards have been implemented to improve anonymization. Bits of information, such as name, address, age, that can be used to uniquely identify, contact, and locate an individual, are classified as “PII”, or “personally identifiable information”, and are erased from published data sets.

Anonymized data sets are made to be “k-anonymous” (k being a number), where thanks to the erasure of PII, each given record is identical to at least k others. For example, if you know that the two people at the top of this picture are both in love with the mystery person, and in addition are both youkai magicians who are always carrying books and have been repeated victims of burglary, you could say that the data set (of only the two magicians) is k-anonymous with k=2. For the known data points, you have two identical records. Of course, 2 is a small number and in practice k is usually bigger.

Interestingly, the graph of people who have been robbed has very strong correlation to the love graph centered around this mysterious witch-hatted woman. Coincidence?

Unfortunately, anonymity does not equal privacy. If you can successfully join from one data set to another and end up with multiple identical records, you gain information if there is even one new column in the result.

For example, if you join from a data set of names to a data set of people with a certain illness, even if you get multiple rows joining to the row of the person you’re interested in, the success of the join lets you know that the person has that illness.

Even our anonymized love graph has problems in theory. Although the images are censored, they show enough that if the characters wear their costumes and hairstyles consistently, and someone has trained enough to recognize anyone in Gensokyo by their costume, the identity of everyone in this graph would be instantly revealed to that someone. Even if there were no images, someone familiar enough with the residents of Gensokyo would be able to narrow down the character in the center of the graph to a handful of girls who are extremely popular with the other girls.

Click for larger image

Of course, I don’t think there is anyone with that level of knowledge besides myself, so the secrets of this graph should remain safe barring accidental leaks.

Aya Shameimaru
Bunbunmaru Newspaper

(Information taken from this lecture.)

8 Responses to “Aya’s lecture on privacy”

  1. Samukun says:

    You should seriously compile all of these after you’ve finished grad school and consider publishing your very own edition of “IT Management for Tohou Fans.”

  2. Khôi says:

    I love loves Mari… ehh Mollisa too!

    I took a CS course about data mining two years ago and they didn’t mention those misconceptions about anonymizing. I always learn something new reading your blog 😉

  3. Shance says:

    You, sir, has just made Computer Engineering courses a lot more fun.

  4. Kurogane says:

    I like the relationship graph the most.

  5. Zeroblade says:

    This is absolute LOL but informative at the same time!

  6. Dio says:

    Aww I was hoping your name was actually Aya Shameimaru on Facebook as I’m not that familar with Touhou :< Have you played MegaMari or Super Marisa World/Land? Anyway I came here because of your Lotus Cobra is Evil comic, though I read it before it was featured on Wizards.com, congrats! I really enjoyed the comic and I really like your art.

  7. Rocket says:

    There is a significant business opportunity in moetic C.S. degrees. I suspect a significant intersection of target audience…

Leave a Reply