k-匿名性

k-匿名性（英語：k-anonymity）是匿名化数据的一种性质。如果一组公开的数据中，任何一个人的信息都不能和其他至少 $k-1$ 人区分开，则称该数据满足k-匿名性。k-匿名性的概念是由拉坦亚·斯威尼（英语：Latanya Arvette Sweeney）和皮兰格拉·萨马拉蒂（英语：Pierangela Samarati）在1998年的一篇论文^[1]中最先提出的，其目的是为了解决如下问题：“给定一组结构化的具体到个人的数据，能否给出一组经过处理的数据，使我们可以证明数据中涉及的个人不能被再识别（英语：Data re-identification），同时还要保证数据仍具有使用价值。”^[2]^[3]^[4]使一组数据满足k-匿名性的过程称为k-匿名化（英語：k-anonymization）。

2018年，英国计算机科学家朱纳德·阿里（英语：Junade Ali）使用k-匿名性及加密散列函数创建了一个通讯协议，可以供人匿名地验证密码是否已经泄露、但又不公开所涉及的密码；k-匿名性因此得到了媒体的广泛报道。^[5]^[6]这一协议作为一个公用API部署在了托里·亨特（英语：Tory Hunt）创立的Have I Been Pwned?（英语：Have I Been Pwned?）服务中，且被包括一些密码管理器^[7]^[8] 和浏览器扩展^[9]^[10]在内的程序广泛使用。随后，谷歌的密码检查功能也使用了这一方法。^[11]^[12]^[13]

k-匿名化的方法

在k-匿名化问题中，一个数据库是指一个n行m列的表。表格的每一行表示一条记录，对应一组对象中的一个。不同行中的记录可以相同。每列中的值代表对象的一个属性。下表是一个未经匿名化操作的数据库，其中包含一些虚构医疗数据。

姓名	年龄	性别	居住地	宗教信仰	疾病
丁一	30	女	北京	佛教	癌症
胡二	24	女	上海	佛教	病毒性疾病
张三	28	女	北京	伊斯兰教	结核病
李四	27	男	广东	不信教	无疾病
王五	24	女	上海	基督教	心血管疾病
赵六	23	男	广东	道教	结核病
孙七	19	男	上海	佛教	癌症
周八	29	男	广东	佛教	心血管疾病
吴九	17	男	上海	基督教	心血管疾病
郑十	19	男	上海	基督教	病毒感染

这组数据中有6个属性、10条记录。对给定的k，实现k-匿名性有两个常见的方法。

数据抑制：此种方法将一些属性的值用星号“*”取代。可以取代一列中的所有值或部分值。在下面的匿名化表格中，我们将“姓名”一栏的所有值、“宗教”一栏的部分值用“*”取代。
数据泛化：此种方法将一些属性的精确值用更宽泛的类别取代。例如，“年龄”一栏中的“19”可以改写为“≤20”，“23”可以改写为“20<年龄≤30”，等等。

下表经过了匿名化处理。

姓名	年龄	性别	居住地	宗教信仰	疾病
标识符	准标识符				目标属性
*	20<年龄≤30	女	北京	*	癌症
*	20<年龄≤30	女	上海	*	病毒感染
*	20<年龄≤30	女	北京	*	结核病
*	20<年龄≤30	男	广东	*	无疾病
*	20<年龄≤30	女	上海	*	心血管疾病
*	20<年龄≤30	男	广东	*	结核病
*	年龄≤20	男	上海	*	癌症
*	20<年龄≤30	男	广东	*	心血管疾病
*	年龄≤20	男	上海	*	心血管疾病
*	年龄≤20	男	上海	*	病毒感染

对敌手（英语：Adversary (cryptography)）而言，“年龄”、“性别”和“居住地”虽然单独不能用于唯一识别一个个体，但结合起来则可能用于识别唯一个体的属性被称为准标识符（英语：quasi-identifier）；相应地，“姓名”、“身份证号”等可以唯一识别一个个体的属性被称为标识符（即ID）。“疾病”、“收入”、“性取向”或其它当事人希望保护的属性常被称为“敏感属性”，也可能成为敌手的“目标属性”。这组匿名化后的数据对于“年龄”、“性别”和“居住地”三个属性具有2-匿名性，因为在这组数据中，任意一行在这三列上的值的组合都至少出现了2次。在k-匿名的数据库中，所有由准标识符组成的多元组都至少出现k次。^[14]

Meyerson和Williams^[15]的研究表明，求最优的k-匿名化方案是一个NP困难的问题；然而，利用诸如k-优化^[16]的启发式方法通常也可以得到令人满意的结果。Kenig和Tassa则提出一个求解k-匿名化问题的 $O(\log k)$ 近似算法。^[17]

可能的攻击

尽管k-匿名化是一个定义简洁且具有很多可行算法的手段，可以较好地解决一组数据的匿名化问题，但从其它角度仍然可以攻击满足k-匿名性的数据。若攻击者掌握并利用其它背景知识，这些攻击甚至可以更有效率。这些攻击包括：

同质性攻击（英語：Homogeneous attack）：如果目标属性（攻击者希望获知的属性）在k个条目中的取值都是相同的，则可以进行此种攻击。这时，就算一组数据已经被k-匿名化，目标属性在k条记录中的取值仍然可以被获取。
背景知识攻击（英語：Background knowledge attack）：此种攻击可以利用目标属性与准标识符属性之间的联系来减少目标属性里可能值的数量。例如，Machanavajjhala等人的研究^[18]表明，利用心脏病在日本人中的发病率较低这一事实，可以在医疗数据库中缩小一个敏感属性的取值范围。

负面影响

由于k-匿名化过程中不包含任何随机化的因素，攻击者可以利用这一情况来探知关于个体的信息。例如在上面的例子中，如果有人已经知道来自上海、19岁的郑十的信息包含在上面的数据库中，则可以可靠地推断他得了癌症、心血管疾病、或病毒感染中的一种。

k-匿名化方法不适用于高维（即具有很多属性）数据库的匿名化。^[19] 例如，有研究^[20]表明，如果给定4个地址，移动电话的时间戳-地点数据库單一性（英语：Unicity (computer science)）（ ${\mathcal {E}}_{4}$ ，取 $k=1$ 的k-匿名性）可能高达95%。

也有研究^[21]表明，如果k-匿名化会不相称地抑制或泛化不具代表性的属性，则该过程可能会导致数据库偏斜。但k-匿名化所使用的抑制或泛化算法也可以改进，来避免导致数据偏斜的发生。^[22]

基于散列的k-匿名化

Junade Ali提出了基于散列的k-匿名化方法；这种方法最早是为了进行密码泄露检查（英语：Compromised Credential Checking）^[23]^[5]^[6]，后来也用于MAC地址的实时匿名化^[24]。

这种方法对一个维度（属性）的数据进行密码散列化，并截取散列码来使散列冲突至少发生 $k-1$ 次。这个方法可以实现对大数据库（例如密码泄露数据库）进行的高效率匿名化检索。^[25]这种方法还可以将匿名化程度量化，以便使用者在信息泄露程度和数据的可使用程度之间取舍。^[24]^[26]

参见

参考资料

^ Samarati, Pierangela; Sweeney, Latanya. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression (PDF). Harvard Data Privacy Lab. 1998 [2017-04-12]. （原始内容存档 (PDF)于2021-03-11）.
^ P. Samarati. Protecting Respondents' Identities in Microdata Release （页面存档备份，存于互联网档案馆）. IEEE Transactions on Knowledge and Data Engineering archive Volume 13 Issue 6, November 2001.
^ L. Sweeney. Database Security: k-anonymity. [2014-01-19]. （原始内容存档于2014-06-16）.
^ L. Sweeney. k-anonymity: a model for protecting privacy （页面存档备份，存于互联网档案馆）. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 卌, 2002; 557-570.
^ ^5.0 ^5.1 Find out if your password has been pwned—without sending it to a server. Ars Technica. [2018-05-24]. （原始内容存档于2021-06-13）（美国英语）.
^ ^6.0 ^6.1 1Password bolts on a 'pwned password' check – TechCrunch. techcrunch.com. [2018-05-24]. （原始内容存档于2021-01-17）（美国英语）.
^ 1Password Integrates With 'Pwned Passwords' to Check if Your Passwords Have Been Leaked Online. [2018-05-24]. （原始内容存档于2021-03-20）（英语）.
^ Conger, Kate. 1Password Helps You Find Out if Your Password Is Pwned. Gizmodo. [2018-05-24]. （原始内容存档于2021-07-08）（美国英语）.
^ Condon, Stephanie. Okta offers free multi-factor authentication with new product, One App. ZDNet. [2018-05-24]. （原始内容存档于2021-03-11）（英语）.
^ Coren, Michael J. The world's biggest database of hacked passwords is now a Chrome extension that checks yours automatically. Quartz. [2018-05-24]. （原始内容存档于2021-02-17）（美国英语）.
^ Wagenseil I, Paul. Google's New Chrome Extension Finds Your Hacked Passwords. www.laptopmag.com. [2021-08-11]. （原始内容存档于2021-06-27）.
^ Google Launches Password Checkup Extension to Alert Users of Data Breaches. BleepingComputer. [2021-08-11]. （原始内容存档于2021-04-29）（美国英语）.
^ Dsouza, Melisha. Google's new Chrome extension 'Password CheckUp' checks if your username or password has been exposed to a third party breach. Packt Hub. 2019-02-06 [2021-08-11]. （原始内容存档于2021-05-07）.
^ Narayanan, Arvind; Shmatikov, Vitaly. Robust De-anonymization of Large Sparse Datasets (PDF). [2021-08-11]. （原始内容存档 (PDF)于2021-01-26）.
^ Roberto J. Bayardo; Rakesh Agrawal. Data Privacy through Optimal k-anonymization (PDF). 2005: 217–28 [2021-08-11]. ISBN 978-0-7695-2285-2. ISSN 1084-4627. S2CID 17044848. doi:10.1109/ICDE.2005.42. （原始内容存档 (PDF)于2020-11-23）. Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k - 1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimal k-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a nontrivial dataset under a general model of the problem. |journal=被忽略 (帮助)
^ Adam Meyerson; Ryan Williams. On the Complexity of Optimal K-Anonymity (PDF). New York, NY: ACM. 2004: 223–8 [2021-08-11]. ISBN 978-1581138580. S2CID 6798963. doi:10.1145/1055558.1055591. （原始内容存档 (PDF)于2014-05-28）. The technique of k-anonymization has been proposed in the literature as an alternative way to release public information, while ensuring both data privacy and data integrity. We prove that two general versions of optimal k-anonymization of relations are NP-hard, including the suppression version which amounts to choosing a minimum number of entries to delete from the relation. We also present a polynomial time algorithm for optimal k-anonymity that achieves an approximation ratio independent of the size of the database, when k is constant. In particular, it is a O(k log k)-approximation where the constant in the big-O is no more than 4. However, the runtime of the algorithm is exponential in k. A slightly more clever algorithm removes this condition, but is a O(k logm)-approximation, where m is the degree of the relation. We believe this algorithm could potentially be quite fast in practice. |journal=被忽略 (帮助)
^ Kenig, Batya; Tassa, Tamir. A practical approximation algorithm for optimal k-anonymity. Data Mining and Knowledge Discovery. 2012, 25: 134–168. S2CID 14158546. doi:10.1007/s10618-011-0235-9.
^ Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke, Johannes; Venkitasubramaniam, Muthuramakrishnan. L-diversity: Privacy Beyond K-anonymity. ACM Transactions on Knowledge Discovery from Data. March 2007, 1 (1): 3–es. ISSN 1556-4681. S2CID 679934. doi:10.1145/1217299.1217302. Background Knowledge Attack. Alice has a pen-friend named Umeko who is admitted to the same hospital as Bob and whose patient records also appear in the table shown in Figure 2. Alice knows that Umeko is a 21-year-old Japanese female who currently lives in zip code 13068. Based on this information, Alice learns that Umeko’s information is contained in record number 1,2,3, or 4. Without additional information, Alice is not sure whether Umeko caught a virus or has heart disease. However, it is well known that Japanese have an extremely low incidence of heart disease. Therefore Alice concludes with near certainty that Umeko has a viral infection.
^ Aggarwal, Charu C. On k-Anonymity and the Curse of Dimensionality. VLDB '05 – Proceedings of the 31st International Conference on Very large Data Bases. Trondheim, Norway. 2005. ISBN 1-59593-154-6.
^ de Montjoye, Yves-Alexandre; César A. Hidalgo; Michel Verleysen; Vincent D. Blondel. Unique in the Crowd: The privacy bounds of human mobility. Scientific Reports. 2013-03-25, 3: 1376 [2021-08-11]. Bibcode:2013NatSR...3E1376D. PMC 3607247 . PMID 23524645. doi:10.1038/srep01376.
^ Angiuli, Olivia; Joe Blitzstein; Jim Waldo. How to De-Identify Your Data. ACM Queue. ACM. [2021-08-11]. （原始内容存档于2021-04-22）.
^ Angiuli, Olivia; Jim Waldo. Statistical Tradeoffs between Generalization and Suppression in the De-Identification of Large-Scale Data Sets. IEEE Computer Society Intl Conference on Computers, Software, and Applications. June 2016: 589–593. ISBN 978-1-4673-8845-0. S2CID 17716908. doi:10.1109/COMPSAC.2016.198.
^ Li, Lucy; Pal, Bijeeta; Ali, Junade; Sullivan, Nick; Chatterjee, Rahul; Ristenpart, Thomas. Protocols for Checking Compromised Credentials. 2019-09-04. arXiv:1905.13737  [cs.CR].
^ ^24.0 ^24.1 Ali, Junade; Dyo, Vladimir. Practical Hash-based Anonymity for MAC Addresses. 17th International Conference on Security and Cryptography (SECRYPT 2020). 2020: 572–579. ISBN 978-989-758-446-6. S2CID 218629946. arXiv:2005.06580 . doi:10.5220/0009825105720579.
^ Thomas, Kurt; Pullman, Jennifer; Yeo, Kevin; Raghunathan, Ananth; Kelley, Patrick Gage; Invernizzi, Luca; Benko, Borbala; Pietraszek, Tadek; Patel, Sarvar; Boneh, Dan; Bursztein, Elie. Protecting accounts from credential stuffing with password breach alerting. 2019: 1556–1571 [2020-05-22]. ISBN 9781939133069. （原始内容存档于2021-04-15）（英语）.
^ Demir, Levent; Kumar, Amrit; Cunche, Mathieu; Lauradoux, Cédric. The Pitfalls of Hashing for Privacy. Communications Surveys and Tutorials, IEEE Communications Society. 2018, 20 (1): 551 [2020-05-22]. S2CID 3571244. doi:10.1109/COMST.2017.2747598. （原始内容存档于2020-11-27）（英语）.

[1] Samarati, Pierangela; Sweeney, Latanya. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression (PDF). Harvard Data Privacy Lab. 1998 [2017-04-12]. （原始内容存档 (PDF)于2021-03-11）.

[2] P. Samarati. Protecting Respondents' Identities in Microdata Release （页面存档备份，存于互联网档案馆）. IEEE Transactions on Knowledge and Data Engineering archive Volume 13 Issue 6, November 2001.

[3] L. Sweeney. Database Security: k-anonymity. [2014-01-19]. （原始内容存档于2014-06-16）.

[4] L. Sweeney. k-anonymity: a model for protecting privacy （页面存档备份，存于互联网档案馆）. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 卌, 2002; 557-570.

[#1-5] 5.0 ^5.1 Find out if your password has been pwned—without sending it to a server. Ars Technica. [2018-05-24]. （原始内容存档于2021-06-13）（美国英语）.

[#2-6] 6.0 ^6.1 1Password bolts on a 'pwned password' check – TechCrunch. techcrunch.com. [2018-05-24]. （原始内容存档于2021-01-17）（美国英语）.

[7] 1Password Integrates With 'Pwned Passwords' to Check if Your Passwords Have Been Leaked Online. [2018-05-24]. （原始内容存档于2021-03-20）（英语）.

[8] Conger, Kate. 1Password Helps You Find Out if Your Password Is Pwned. Gizmodo. [2018-05-24]. （原始内容存档于2021-07-08）（美国英语）.

[9] Condon, Stephanie. Okta offers free multi-factor authentication with new product, One App. ZDNet. [2018-05-24]. （原始内容存档于2021-03-11）（英语）.

[10] Coren, Michael J. The world's biggest database of hacked passwords is now a Chrome extension that checks yours automatically. Quartz. [2018-05-24]. （原始内容存档于2021-02-17）（美国英语）.

[11] Wagenseil I, Paul. Google's New Chrome Extension Finds Your Hacked Passwords. www.laptopmag.com. [2021-08-11]. （原始内容存档于2021-06-27）.

[12] Google Launches Password Checkup Extension to Alert Users of Data Breaches. BleepingComputer. [2021-08-11]. （原始内容存档于2021-04-29）（美国英语）.

[13] Dsouza, Melisha. Google's new Chrome extension 'Password CheckUp' checks if your username or password has been exposed to a third party breach. Packt Hub. 2019-02-06 [2021-08-11]. （原始内容存档于2021-05-07）.

[14] Narayanan, Arvind; Shmatikov, Vitaly. Robust De-anonymization of Large Sparse Datasets (PDF). [2021-08-11]. （原始内容存档 (PDF)于2021-01-26）.

[Optimal-15] Roberto J. Bayardo; Rakesh Agrawal. Data Privacy through Optimal k-anonymization (PDF). 2005: 217–28 [2021-08-11]. ISBN 978-0-7695-2285-2. ISSN 1084-4627. S2CID 17044848. doi:10.1109/ICDE.2005.42. （原始内容存档 (PDF)于2020-11-23）. Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k - 1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimal k-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a nontrivial dataset under a general model of the problem. |journal=被忽略 (帮助)

[16] Adam Meyerson; Ryan Williams. On the Complexity of Optimal K-Anonymity (PDF). New York, NY: ACM. 2004: 223–8 [2021-08-11]. ISBN 978-1581138580. S2CID 6798963. doi:10.1145/1055558.1055591. （原始内容存档 (PDF)于2014-05-28）. The technique of k-anonymization has been proposed in the literature as an alternative way to release public information, while ensuring both data privacy and data integrity. We prove that two general versions of optimal k-anonymization of relations are NP-hard, including the suppression version which amounts to choosing a minimum number of entries to delete from the relation. We also present a polynomial time algorithm for optimal k-anonymity that achieves an approximation ratio independent of the size of the database, when k is constant. In particular, it is a O(k log k)-approximation where the constant in the big-O is no more than 4. However, the runtime of the algorithm is exponential in k. A slightly more clever algorithm removes this condition, but is a O(k logm)-approximation, where m is the degree of the relation. We believe this algorithm could potentially be quite fast in practice. |journal=被忽略 (帮助)

[17] Kenig, Batya; Tassa, Tamir. A practical approximation algorithm for optimal k-anonymity. Data Mining and Knowledge Discovery. 2012, 25: 134–168. S2CID 14158546. doi:10.1007/s10618-011-0235-9.

[18] Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke, Johannes; Venkitasubramaniam, Muthuramakrishnan. L-diversity: Privacy Beyond K-anonymity. ACM Transactions on Knowledge Discovery from Data. March 2007, 1 (1): 3–es. ISSN 1556-4681. S2CID 679934. doi:10.1145/1217299.1217302. Background Knowledge Attack. Alice has a pen-friend named Umeko who is admitted to the same hospital as Bob and whose patient records also appear in the table shown in Figure 2. Alice knows that Umeko is a 21-year-old Japanese female who currently lives in zip code 13068. Based on this information, Alice learns that Umeko’s information is contained in record number 1,2,3, or 4. Without additional information, Alice is not sure whether Umeko caught a virus or has heart disease. However, it is well known that Japanese have an extremely low incidence of heart disease. Therefore Alice concludes with near certainty that Umeko has a viral infection.

[19] Aggarwal, Charu C. On k-Anonymity and the Curse of Dimensionality. VLDB '05 – Proceedings of the 31st International Conference on Very large Data Bases. Trondheim, Norway. 2005. ISBN 1-59593-154-6.

[20] Montjoye, Yves-Alexandre; César A. Hidalgo; Michel Verleysen; Vincent D. Blondel. Unique in the Crowd: The privacy bounds of human mobility. Scientific Reports. 2013-03-25, 3: 1376 [2021-08-11]. Bibcode:2013NatSR...3E1376D. PMC 3607247 . PMID 23524645. doi:10.1038/srep01376.

[21] Angiuli, Olivia; Joe Blitzstein; Jim Waldo. How to De-Identify Your Data. ACM Queue. ACM. [2021-08-11]. （原始内容存档于2021-04-22）.

[22] Angiuli, Olivia; Jim Waldo. Statistical Tradeoffs between Generalization and Suppression in the De-Identification of Large-Scale Data Sets. IEEE Computer Society Intl Conference on Computers, Software, and Applications. June 2016: 589–593. ISBN 978-1-4673-8845-0. S2CID 17716908. doi:10.1109/COMPSAC.2016.198.

[23] Li, Lucy; Pal, Bijeeta; Ali, Junade; Sullivan, Nick; Chatterjee, Rahul; Ristenpart, Thomas. Protocols for Checking Compromised Credentials. 2019-09-04. arXiv:1905.13737  [cs.CR].

[#3-24] 24.0 ^24.1 Ali, Junade; Dyo, Vladimir. Practical Hash-based Anonymity for MAC Addresses. 17th International Conference on Security and Cryptography (SECRYPT 2020). 2020: 572–579. ISBN 978-989-758-446-6. S2CID 218629946. arXiv:2005.06580 . doi:10.5220/0009825105720579.

[25] Thomas, Kurt; Pullman, Jennifer; Yeo, Kevin; Raghunathan, Ananth; Kelley, Patrick Gage; Invernizzi, Luca; Benko, Borbala; Pietraszek, Tadek; Patel, Sarvar; Boneh, Dan; Bursztein, Elie. Protecting accounts from credential stuffing with password breach alerting. 2019: 1556–1571 [2020-05-22]. ISBN 9781939133069. （原始内容存档于2021-04-15）（英语）.

[26] Demir, Levent; Kumar, Amrit; Cunche, Mathieu; Lauradoux, Cédric. The Pitfalls of Hashing for Privacy. Communications Surveys and Tutorials, IEEE Communications Society. 2018, 20 (1): 551 [2020-05-22]. S2CID 3571244. doi:10.1109/COMST.2017.2747598. （原始内容存档于2020-11-27）（英语）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]