+ All documents
Home > Documents > Disclosure risks for microdata

Disclosure risks for microdata

Date post: 14-Nov-2023
Category:
Upload: cbs-nl
View: 2 times
Download: 0 times
Share this document with a friend
19
49 Statistica Neerlandica (1992) Vol. 46, m.1, pp. 49-67 Disclosure risks for microdata R.J. Mokken. University of Amsterdam Oudezijds Achterburgwal237 NL-1012 DL Amsterdam The Netherlands P. Kooiman J. Pannekoek L.C.R.J. Willenborg Netherlands Centfa/ Bureau of Statistics P.O. Box 959 NL-2270 AZ VOOrbUrg The Netherlands In this paper a model is developed for assesing disclosure risks of a microdata set. It is an extension of the one presented in BEPILEHEM et al. (1988). It is used to calculate (an upper bound of) the risk that an investigator is able to re- identify at least one individual in an anonimyzed data set, and hence discloses some sensitive information about him. This risk is shown to depend on, among other things, two variables which can be controlled by the statistical office which is disseminating such a data set: the 'coarseness' of the key variables and the size of the data set. The model yields guidelines as to the usage of these two instruments to control the disclosure risk. Keywords & Phrases: disclosure risk, microdata, multiple investigators, multiple keys. 1. INTRODUCTION In KELLER and BETHLEHEM (1987,1992) and BETHLEHEM et al. (1988) decisions whether or not a microdata set can be made available to an outside user are based on estimates of the number of elements in the microdata set which are unique in the population in terms of their values on a given combination of identifying variables in the microdata set (the so-called re-identification key). * The views expressed in this paper are those of the authors and do not necessarily reflect the'poli- cies of the Netherlands Central Bureau of Statistics. The authors are grateful to Wil de Jong for his programming assistance.
Transcript

49

Statistica Neerlandica (1992) Vol. 46, m.1, pp. 49-67

Disclosure risks for microdata

R.J. Mokken. University of Amsterdam

Oudezijds Achterburgwal237 NL-1012 DL Amsterdam

The Netherlands

P. Kooiman

J. Pannekoek

L.C.R.J. Willenborg Netherlands Centfa/ Bureau of Statistics

P.O. Box 959 NL-2270 AZ VOOrbUrg

The Netherlands

In this paper a model is developed for assesing disclosure risks of a microdata set. It is an extension of the one presented in BEPILEHEM et al. (1988). It is used to calculate (an upper bound of) the risk that an investigator is able to re- identify at least one individual in an anonimyzed data set, and hence discloses some sensitive information about him. This risk is shown to depend on, among other things, two variables which can be controlled by the statistical office which is disseminating such a data set: the 'coarseness' of the key variables and the size of the data set. The model yields guidelines as to the usage of these two instruments to control the disclosure risk.

Keywords & Phrases: disclosure risk, microdata, multiple investigators, multiple keys.

1 . INTRODUCTION

In KELLER and BETHLEHEM (1987,1992) and BETHLEHEM et al. (1988) decisions whether or not a microdata set can be made available to an outside user are based on estimates of the number of elements in the microdata set which are unique in the population in terms of their values on a given combination of identifying variables in the microdata set (the so-called re-identification key).

* The views expressed in this paper are those of the authors and do not necessarily reflect the'poli- cies of the Netherlands Central Bureau of Statistics. The authors are grateful to Wil de Jong for his programming assistance.

50 R.J. Mokken, P. Kooiman, J. Pannekoek, L.C.R.J. Willenborg

They suggested two related criteria: the absolute criterion of a (critical) upper bound to that number and a relative one consisting of an upper bound for the estimated proportion of population uniques in the microdata set.

In this article a new criterion called the “disclosure risk” is proposed which depends not only on the number of unique elements in the population but also on the sample fraction and on the number of “acquaintances” of the outside user getting access to the data file. An acquaintance is thereby defined as any person in the population whose score on the key considered is known by the ouside user. In sections 2 and 3 we develop the basic model restricting our- selves to a single outside user (or investigator) and a single given key K. “Dis- closure risk” is defined here as the (conditional) probability that on the given key at least one element in the sample (the microdata set) can be identified by the investigator. In section 4 the model is extended to a single key with multi- ple investigators and in section 5 to a single investigator with multiple keys. WILLENBORG (1990a) consideres the situation of multiple keys and multiple investigators. Section 6 contains a discussion of some points brought up in this paper and in other papers in this issue.

2. THE DISCLOSURE RISK

2.1. Definitions

In this subsection the concepts that will be used in the remainder of this paper are defined. It is assumed that a population of size N is identified and a sam- ple of size n has been drawn from this population. The sample fraction is denoted by f. Furthermore, a number of variables is measured error-free for each sample element. The values obtained by these measurements (scores) are collected in records, one for each sample element. Together, the records consti- tute a data set (D) that will be made available to an investigator (I). We assume that the micro set is anonymized, i.e. formal identifiers (name, address, public identification numbers, etc.) are removed from the set. We now define the following concepts:

Key fK) A key is a subset of the variables in the data set which alone or in combina- tion can generally serve to re-identify some elements by some investigators if they know the values on the key for those elements. It is assumed that the variables in the key are variables for which the measurements fall in a finite number of categories, derivable from the categories reported in D. The key can be considered as one variable with the possible number of values equal to the product of the possible numbers of values of the variables it consists of. The number of possible values of the key is denoted by k.

Disclosure risks for microdata 51

Circle of acquaintances (A) With respect to investigator I and key K, a circle of acquaintances (A) is defined as the subset of the population elements for which investigator I knows the value of the key. Obviously, A and its size a will depend on the par- ticular investigator I as well as on the key K : A = A (KJ). The number of ele- ments of A is a and the fraction of acquaintances of I in the population is fp A circle of acquaintances can, in the worst case, consist of another available microdata set containing additional identifying information, which can be used by I for matching with the microdata set D to be released.

Uniqueness A population element E is unique on the key K if there are no other.elements in the population with the same key value as E. The set of unique population elements is U= U(K), with size u, representing a fraction I; of the total popu- lation.

Identrfjcation Identification of an element E by an investigator I on a key K, occurs if all conditions corresponding to the following events are satisfied: C1. E is unique on K (probability: I;). C2. E belongs to A (probability: h). C3. E is an element of the sample (probability: f). C4. I knows that E is unique on K. C5. I comes across E in data set D.

Condition C4 is a rather exacting one, but it can be introduced as an assump- tion for the sake of convenience in formulating our model. Note that it then yields a worst-case situation, in the sense that fallible perception and memory or other sources of ignorance, confusion and uncertainty for a potential dis- closer are excluded. Taken as an assumption together with C5 the implication is that the occurrence of any unique acquaintance E of I in data set D is equivalent to identification by I.

Disclosure Disclosure of information on E by I takes place if E is identified and Es record contains scores that were not known by I before the data set became available. It means that certain information concerning E, which is new to I, is revealed to I. We shalI assume that such information 'to be disclosed' is always present, so that re-identification necessarily entails disclosure. The number of disclosures is the number of elements E for which disclosure has occurred. The type of disclosure defined here is called identity disclosure; it has to be con- trasted with prediction disclosure see SKINNER (1992).

Disclosure risk (R) The disclosure risk for a certain microdata set D with respect to a certain investigator I and a certain key K, is the probability that I makes at least one

52 R.J. Mokken, P. Kooiman, J. Pannekoek, L.C.R.J. Willenborg

disclosure of a record in D on the basis of K. Actual disclosure requires full completion of the set of events C1-C5 above. The probability of such a disclo- sure for given I and E can be denoted by

Pr(C l,..,CS) (1) In this paper we shall derive some results concerning probabilities involving the first three Occurrences in (1)

R =Pr(C 1,C2,C3) (2)

Pr(C 1,..,C5)=R XPr(C4,C51C 1,C2,C3) (3)

instead of

The last factor of (3) denotes the (conditional) probability that, given the fulfillment of Cl-C3, investigator I actually achieves full disclosure by satisfy- ing conditions C4C5. This conditional probability, however, depends on rather subjective circumstances and is therefore much more difficult to model and estimate. If, as stated above, C4C5 are taken as assumptions, then this conditional probability is trivially equal to one, so that the risk R is equal to the probability (1). On the whole, however, it will be clear from (3) that the risk R can be seen as an upper bound for the actual disclosure risk for I.

Disclosure control

Disclosure control denotes all activities that are undertaken to ensure that, for a given data set, the disclosure risk is below a specified bound. This bound, will be called a criterion (for a “safe” data set). In this article a criterion of the type “RGy” will be used, where y is a threshold risk value to be chosen a priori.

2.2. A model for the disclosure risk

In order to apply a criterion based on the expected number of disclosures or on the disclosure risk, the values of these quantities for a given data set are needed. In this section expressions for these quantities are derived on the basis of a set of assumptions. Throughout this section we shall assume that identification of a record implies disclosure of confidential information. Thus identification can be treated as equivalent to disclosure.

In addition to C1-C5 we wil l assume that for any population element the three events ‘belonging to D, ‘belonging to A’ and ‘belonging to U’ are independent and that the probabilities of the Occurrence of these events are equal to f = n / N , f , = a / N and f , = u / N respectively, for each El with n the number of sample elements and N the number of population elements. This assumption can be written as:

A 1. Pr ( E E D n A f l U) = Pr ( E E D)Pr ( E E A )Pr ( E E U) = ffJ,, for an arbitrary population element E. With these assumptions (Cl-C5, Al),

Disclosure risks for microdata 53

the expected number of disclosures is equal to the expected number (pus of unique acquaintances of I in the sample D which is given by

PanS=Nf& (4)

For a random sample without replacement, the distribution of the number of unique acquaintances in the sample (u,,) is hypergeometric with parameters n,NfJu =uo and N (JOHNSON and KOTZ, 1969, ch.6). For small f; this distribu- tion can be approximated by the binomial distribution with parameters n and uJN. According to this approximation, the probability that none of the N population elements will be disclosed is Pr(u,, =0)-(1- fJU)“. The disclo- sure risk R is the complement of this probability:

R=Pr(u,,>O)=l - P ~ ( u u , = O ) ~ l - ( l - fJu)”. ( 5 )

When n is large and fafy is small (as is usually the case) we have h ( l - R ) = - n f d u = - N f J u , or equivalently

- NffJ. (6) R m l - e

Note that the right hand side of (6) is equal to 1 -exp(-p,,). In WILLENBORG (1990b) alternative definitions of the disclosure risk are given that, under con- ditions that will usually be fulfilled in practical situations, also lead to approxi- mation (6). Formulas ( 5 ) and (6) show that the risk can be influenced by the statistical office that disseminates the microdata, because it can “control” the size of the data set to be released (which we have taken here equal to the sam- ple size n) and the fraction of uniques (which depends on the “coarse graining” of the information provided by the key variables).

The risk formulas ( 5 ) or (6) can also be used to calculate the maximal sam- ple fraction. That is to say, for particular values of andf,, a value fmax forf can be calculated, such that R = y . Since the risk is an increasing function of the sample fraction, data sets with a sample fraction smaller than or equal to fmax can be disseminated “safely” in the sense that the disclosure risk is smaller than y. The maximal sample fraction can be used for a specific disclo- sure avoidance technique, namely subsampling. If the actual sample fraction is larger than this maximal sample fraction, it is possible to create a “safe” data set by subsampling from the original data set.

Using ( 5 ) fmax can be approximated by

Alternatively, the approximation (6) can be used resulting in

which, for small y, can in turn be approximated by v

54 R. J. Mokken. P. Kooiman, J. Pannekoek, L.C.R. J. Wi//enborg

To calculate R and pas, for a given data set, values forf, and f, must be avail- able. As we shall see in the next section, f, may be estimated from the sample. Clearly f a is unknown but one can “guestimate” values for f, for various types of investigators. In the tradition of risk analysis and sensitivity analysis a deci- sion maker of a statistical office could investigate the risks across a range of likely sizes u of A according to the types of users of the data set D to be con- sidered. Moderate sizes of say 150 - 300 corresponding to the familiar type of dedicated academic investigators and larger sizes (say lo00 or more) for spe- cialized agencies which are likely to possess comparable databases with identified records to be matched to D when available. Again, one might just postulate a reasonable upper bound for h, or even use the worst-case value 1 , which assumes that every population element is an acquaintance of investiga- tor I . The worst-case value will correspond to the special but important case that the investigator has a register at his disposal with fully identified elements of which the elements of data set D form a subset, so that all records of D can be matched.

2.3. Estimating the disclosure risk

In order to estimate the disclosure risk, we must have an estimator for the expected number p of population uniques. For that purpose we consider the population to be partitioned into k cells, corresponding to the k key values. The population frequency in cell i ( i = 1, ..., k ) is denoted by Y. So, a unique key value corresponds to Yi = 1 and the problem is to estimate the number of i’s for which Yi = 1. In this paper we will use an estimator based on the Poisson-gamma model (BETHLEHEM et.al., 1988, MOKKEN et al., 1989) for this purpose, but the risk-approach considered in this paper can also be applied using alternative estimators for the number of population uniques. According to the Poisson-gamma model the Yi have a negative binomial distribution with parameters N B and a and have expectation N / k . Since the expected value of a negative binomial distribution with parameters N B and a is NBa, we have, for the Poisson-gamma model, a= l l (kB) . According to the Poisson-gamma model the expected number of population uniques p is given by

(10) p= kPr( Y = 1) = kaiVB( 1 + N B ) - ( ’ + ” ) = N (1 +NB)6)-(’+”(kB).

An estimator, b, for the expected number of population uniques, p, can be obtained by replacing /3 in (10) by a maximum likelihood or moment estimator bas9d on the sample data. By using the estimator for p,fu c y be estimated by fu = b / N and the risk R can be estimated by replacingf, by fu in (5) or (6).

3. THE DISCLOSURE RISK IN SUBPOPULATIONS

3. I . A model for the disclose risk in subpopulations

In this subsection, as in BETHLEHEM et al. (1988) and KELLER and BETHLEHEM (1992), data sets are considered that contain as one of the key variables a

Disclosure risks for microdata 55

variable indicating subpopulations for which the subpopulation totals N, (with XJNJ = N ) are known.

According to assumption Al, the probability that a randomly chosen popu- lation element is an acquaintance is equal for each population element imply- ing, in the case of subpopulations, that the acquaintances are distributed pro- portionally over the subpopulations. However, the models discussed in this sec- tion enable the calculation of the risk for alternative allocations of the acquain- tances. In particular, the risk can be calculated for the (worst-case) allocation for which the risk is maximal.

One possibility for modeling a population that can be subdivided into sub- populations of known size, is by using separate negative binomial distributions for each subpopulation i.e.

5 - neg- NJBJ), (1 1) - where- Y, is an arbitrary cell frequency in subpopulation j and a, = l/(kB,), with k the number of values of the combination of key variables excluding the subpopulation indicator. This model will, in general, give a better description of the population than a model that ignores the subpopulation structure (as defined by a subset of the key variables) and that consists of only one j? parameter and one population size N and, consequently, only one a parameter. It also enables the estimation of the expected number of unique elements for each subpopulation, p, say. And the expected number of unique elements in the entire population can be estimated as $+ = X j f i J . A simpler model that also uses the known subpopulation size is obtained by constraining the B parame- ters to be equal for all j , i.e.

Y, - neg-bin(a,NJB), (12) with a= l/(kS).

In the case of subpopulations, we will replace assumption A1 by

A2. Pr ( E, E D, n A, n U,) = Pr (E, E D,)Pr (E, EAJ)Pr(EJ E U,) =AhJ fuJ,

where EJ is a population element from subpopulation j ,A, and U, are the sub- sets of A and U corresponding to subpopulation j and fa/ =aJ/NJ, f / =",INJ and LJ= u,/N,. The probability that none of the N population elements will be disclosed, is now approximately IIJ( 1 -f,f,)nJ, and an approximation to R analogous to (6 ) is given by

In (1 - R) NN -ZjNjjjLjLj = - Z J J J ' .fa .f j ,

According to (13) the approximated risk is maximal if all acquaintances belong to the subpopulation with the largest value of Ahj.

If model (1 1) is used, we can calculate the maximal risk as follows. First, the parameters @, are estimated. With these estimates and the known values for Nj and k, the fractionshj can be estimated. Now, the maximal risk can be calculated as

ln(1 -R)= -faf&,,,=, (14)

56 R J Mokken, P Kooiman, J Pannekoek, L C R J W///enborg

where (for simplicity) f=fJ=nJ/NJ and fu,,,= is the maximum of the fuJ

values. If m@el(l2) is used and fi =f; we do not have to estimate each f q in order

to find fu,,,, since it can be verified (from (10) and p>O) that afuJ/aNJ is negative fol all NJ and juJ is a *monotonically decreasing function of N,. Con- sequently, &,, is the value off./ for the smallest subpopulation.

Because of this relation between the estimated maximal risk under model (12) and the subpopulation size, it is useful to define the minimal subpopulation size, N,,,, as the subpopulation size for which the estimated maximal risk equals y, given the values of the other parameters. Therefore, we have

A 1

Combining (16) and (10) results in

(17) Nhn = (C - +a) - 1)

where C = -In(l-y)/(af). The value for Nmin indicates how refined the regional classification can be: the regional classification must be such that the size of the smallest subpopulation is larger than Nhn. Since the derivative aNhn/a(af) is strictIy positive, N ~ , is an increasing function of the term af, the expected number of acquaintances of I in the sample D. Negative or zero values for Nh, indicate that every possible subpopulation size is large enough to satisfy the criterion for a “safe” data set.

3.2. Illustration

TABLE 1. Risk values X lo3 for values of N,a and f.

f a Subpopulation size and ju X lo3

= 31 812 63 624 127 248 Y f,x103 = 2.17 1.00 0.46

0.001 30 300

1000

0.01 30 300

lo00

0.07 0.03 0.0 1 0.65 0.30 0.14 2.17 1.00 0.46

0.65 0.30 0.14 6.49 3.00 1.38

21.48 9.95 4.59

To illustrate how the theory discussed in the previous sections can be applied, we shall employ the same data as KELLER and BETHLEHEM (1992). This data file contains the scores of a sample of 8399 individuals from the Dutch

Disc/osure risks for microdata 57

population in 1980 on four variables: household composition, age, marital status, and sex. Together, these variables constitute a key with 1108 key values. In order to decide whether or not a data set-can “safely” be made available, Keller and Bethlehem employ the criterion f,90.001 and conclude that the data file analysed in this example would be “safe” for subpopulation sizes of at least 63 624, irrespective of the sample fraction and the number of acquaintances.

In order to compare this result with results obtained by the risk criterion (with y=O.001) of the present paper, we have calculated, for this key, the max- imal risk for several values for the number of acquaintances a, the sample frac- tion f and subpopulation sizes of 0.5 X 63624,63624 and 2 X 63624. The results are displayed in table 1, where it is apparent that data sets with a ,minimal subpopulation size of 63 624 are not “safe” if the sample fraction exceeds 0.01 and the number of acquaintances is 300 or more. The right most column shows that the same conclusion applies to a ’minimal subpopulation size of twice 63 624, although the risk is, of course, smaller. Table 1 also shows that a subpo- pulation size that is only half as large as the minimal subpopulation size deter- mined by Keller and Bethlehem can still be “safe” according to the risk cri- terion for some combinations of the number of acquaintances and the sample fraction.

4. MULTIPLE INVESTIGATORS

Thus far the criteria and disclosure risks were considered for the .bilateral situation of a single user or investigator I getting access to a copy of a specific data set D. This corresponds more or less to the situation of a controlled dis- tribution of D to qualified single investigators for restricted and personal use only. When data sets are distributed for public use, however, access of users to those data sets can hardly be controlled, so that usage by multiple users associ- ated with the public circulation of the data sets concerned should be envisaged. If a microdata set is delivered to several, m say, investigators, our interest is in the risk (R,) that at least one of them establishes at least one dis- closure.

Corresponding to the m investigators there are m circles of acquaintances, A,(I = l...m) say, with numbers of elements a/. The number of population ele- ments that is a member of at least one circle of acquaintances, a,,, say, is the size of the union of the m sets A / , i.e. a,=IA UA2 ... UA,,,I. If the sets A! can be considered as independent random samples from the population and the data set is also a random sample from the population, we have that, analogous to (4), the expected number of sample elements that is unique and at least in one of the A / equals ffamp. Now, we can approximate the risk R, (in analogy to ( 5 ) ) by

R,,,%l -(1 - fa fu )” .

In order to calculate R,, we need a value for h-. One approach is to begin by

58 R.J. Mokken, P. Kooiman, J. Pannekoek, L.C.R.J. Willenborg

specifying a value for each a,, just as in the case of one investigator. Now, since the sets A , are considered to be random samples from the population, a,,, (and, consequently f , ) is a random variable. The realized value of a, in our hypothetical sample of m circles of acquaintances, is of course unknowni We might, however, as an approximation, replace a, by its expectation and approximate the risk by

(19)

with fEam = Eu, /N. MANTEL and PASTERNACK (1968) and GITTELSOHN (1969) have investigated the distribution of a in the context of a so-called "committee problem", cf. JOHNSON and KOTZ (1977, pp. 162-176). For the case where ax =a for all I the distribution of a,,, is given by

R m ' = 1 - ( 1 - f ~ n f u ) " 3

.N,

JOHNSON and KOTZ (1977) describe a generalization for unequal a,. The expectation of a,,,, in the case where a, =a for all I, is

(21) a Eu,,, = N - N (1 - -)" N

Note that for m = l,Ea,,, is simply a and for large values of m,Eam approaches its maximal value, N . If a = N , then Eu,,, = N , irrespective of the number of investigators. If a / N is small, Ea, is approximately ma.

To illustrate the effect of multiple investigators on the disclosure risk, we continue th,e example of the previous section. From table 1 we find that, for N =63624,fu =O.OOl,f=O.00l and a =30,3OO, 10o0, the risk for one investiga- tor is O.ooOo3, 0.0003 and 0.001, resp-mtively. In figure 1, the approximation, R',, is plotted, for these values of N , f , , f and a, as a function of the number of investigators, m.

Figure 1 .

d rod0

4 as a funct ion m f o r N-63 624, f,-0.001,

three values for a .

m A

f-0.001 and

Disclosure risks tor microdata 59

Figure 1 shows that for larger values of a (the number of acquaintances per investigator), the risk R', is increasing more sharply with the number of inves- tigators than for smaller values. The value 0.061672 of R', for a = lo00 and m = loo0 is equal to the maximum of R', corresponding to Ea, = N and the value 0.061 143 of R', for a =300 and m = lo00 is almost equal to this max- imum.

5 . MULTIPLE KEYS

5.1. The disclosure risk for multiple keys

Typical statistical data sets contain many variables in substantial detail. In the preceding sections we have analysed the disclosure risk conditional on a given single key K. However, for practical disclosure risk evaluation we have to con- sider the joint disclosure risk associated with the set of all potential re- identification keys contained in the data set. We therefore turn to the evalua- tion of the disclosure risk associated with multiple keys. We shall first extend the theory of section 2, leaving practical matters and an example for the next subsections.

The basic idea is the following. When progressively aggregating categories of (a subset of) the variables present in D we generate a series of keys getting less and less detailed. Associated with each one of these keys is a circle of acquain- tances and a set of population uniques. The circles of acquaintances are likely to become wider the less detailed the key gets, whereas the number of uniques simultaneously decreases. The score on the variable 'age', for instance, is largely visible, and therefore widely known, when measured in 10-year groups, but in most cases it is only known precisely, i.e. in 1-year groups, for a few close friends and relatives. At the same time a key involving 1-year age groups generates considerably more uniques than the same key with 10-year age groups. So, generally speaking, an inverse relationship exists between the number of population uniques and the size of the circle of acquaintances asso- ciated with a key.

To make things somewhat more precise we define the maximal key K, as the key given by the complete crossing of all the variables in D, with catego- ries as recorded in D. This is the key with maximum detail, i.e. the highest possible resolution. We then consider the complete set KO of subordinate keys consisting of any key, K, included, that can be derived from K,,, by collapsing its categories. The minimal key 0 is obtained by maximum collapsing and thereby consists of a single category only, so that it is without any information. To find an expression for the total risk TR involved in releasing D we have to consider the probability that at least one element E of D is re-identified on at least one of the keys in KO. Analogous to the derivation of equation ( 5 ) for the risk R associated with a single key it is straightforward to derive

TR=I-P" , (22) where P is the probability of non-disclosure of a single randomly chosen

60 R.J. Mokken, P. Kooiman, J. Pannekoek, L.C.R.J. Willenborg

element E of D jointly on all elements of KO. We now order the elements of KD according to some measure of the amount

of detail involved, e.g., the total number of categories or the number of popu- lation uniques. Let the ordered elements of KD be K,, for i = 1, ..., m such that K1 = 0 and rn is the number of elements (Neglecting the fact that some meas- ures of detail may not result in a unique ordering). The joint non-disclosure probability P of a single element E of D then obtains as

p = (1 -p I ) W = 2 ( 1 -p,) (23) where p1 is the probability of disclosure on the minimal key 0 , and, for i=2, ..., m, p , is the probability that a randomly chosen element E of D is dis- closed on key i, conditionally on the fact that it is not disclosed on any lower order key K1,...,K,-l. The first right hand side factor is trivially equal to 1 since the minimal key is informationless and can never contribute to the dis- closure risk.

The conditional probabilities p, are difficult to evaluate when we do not impose further structure on the problem. Let A , = A (K,,I), U,= U(K,), h,=IA,I/N, and fu, = I U , J / N . Since circles of acquaintances get smaller and fractions of uniques get larger when keys get more detailed we shall assume

A , C A I - , and U , 3 for i = 2 ,..., 111. (24)

It should be noticed that we need not introduce (24) as an assumption when key K, - can be obtained by collapsing categories of key K,, since then (24) is already implied. However, in practice this is mostly not the case, especially in a multivariate setting, so that we indeed need (24) as an assumption to make further progress. The understanding is that in practice, choosing an appropri- ate measure of the amount of detail involved in a key, it will hold in a global sense, though it may be violated locally.

It follows from (24) that E is disclosed on K, while it is not disclosed on K 1 through K,-l if and only if E € A , n U , \ U,- l . Under the independence assumptions we made earlier this entails

Substituting (25) and (23) into (22), and usingpl =h, fu, , we obtain m

TR = 1 - (1 -fa ,fu I >" I .n =2 [ 1 -fa, %, -fu, _I >I". In practical cases the disclosure probabilities are very small so that (26) can be approximated by

This expression for the total risk can also be obtained by conditioning on higher order keys instead of on lower order keys. Assumption (24) entails that fa, is monotonously non-increasing with i, whereas fu, is monotonously non-

Disclosure risks for microdata 61

decreasing. Figure 2 illustrates this property in case the score on the maximal key is unknown to the investigator for every element in the population, as is usually the case. The risk regron indicated is defined by the range of keys with f&>O. If it exists at all its lower bound, i - , is defined by f , > O for all i> i -, whereas its upper bound, i - , i s defined by f, >O for all i < i - . When evaluating the disclosure risk involved in the release of D it suffices to consider keys in this region only.

1 1- I - m

0

I I I I rRM 1 I I

t '. '" 0

1 I- I- I- rn

FIGURE. 2. The risk region

An interesting special case obtains when fa is (approximately) constant on the risk region. It can easily be checked that in that case the total risk is equal to the single key risk R, as given in ( 5 ) or (a), for the key corresponding to i = i -, so that it suffices to consider a single key only. This situation occurs, for exam- ple, when the data set can be matched with an external data base.

5.2. Selection of keys

In this subsection we address the problem which keys to choose when calculat- ing disclosure risks. The number m of all possible keys is generally very large so that we have to find a shortcut once we turn to practical applications. We tentatively present a reduction procedure aiming at the selection of a limited number of keys to be used in calculating the total disclosure risk. Since our ideas on this issue are still preliminary, and our experience in thcir practical application very limited, the discussion will be somewhat more informal,

62 R.J. Mokken, P. Kooiman, J. Pannekoek, L.C.R.J. Wilenbrg

qualitative and sketchy than in the preceding sections. As an illustration we apply the method in the next subsection to a set of data stemming from the Dutch Labour Force Survey.

In selecting keys to be considered we can restrict ourselves to identifying variables (cf. KEUER and BETHLEHEM, 1992) only. Next we consider the dimension of the keys to be analysed. When it concerns a research file which has to be protected against spontaneous recognition it may suffice to consider two- or three-dimensional keys only, since simultaneous perception and pro- cessing of more than a few characteristics is difficult for the human brain, so that actual recognition becomes unlikely on higher dimensional keys. When we protect against the risk of disclosure by matching higher dimensional keys might be considered, though, depending on the kind of identifiers and the dimensionality of potential matching keys likely to be available in candidate matching files. In both cases thorough subject matter knowledge is required to decide on these issues.

We then have to locate the risk region. This essentially means analysing population uniqueness for different keys, as well as deciding upon the most likely sizes of the associated circles of acquaintances. Potential matchng keys are good candidates for the upper bound of the risk region. In case of research files, where matching is contractually prohibited, we may exploit the fact that detailed knowledge of the score of a person on one identifier often coincides with detailed knowledge of his scores on other identifiers, so that the univariate circles of acquaintances are highly correlated. The size of the circle of acquaintances on a multivariate key is always less than or equal to the minimum of the sizes of the circles of acquaintances associated with the separate variables entering the key. High correlation of the univariate circles of acquaintances entails that we may approximate the size of the multivariate cir- cle of acquaintances by this minimum. In making this approximation we stay on the safe side to the extent that it represents a worst-case solution.

Using this result we may proceed as follows to locate the risk region. For each identifying variable separately we consider a small number of ‘sensible’ aggregations, corresponding with prespecified standard sizes of the circle of acquaintances. When considering income X age, for example, one may ask at which levels of aggregation an investigator knows income of, sa 10, 50, 100, or 200 other persons. We may do the same for age. This gives $= 16 possible ‘income age’ keys. However, combinations with different numbers of acquain- tances for income and for age are ‘inadmissible’ to the extent that there exists another key, with (approximately) the same size of the circle of acquaintances, but with a larger set of population uniques, so that the risk is higher. Therefore we can restrict attention to only the 4 bivariate keys consisting of aggregations of variables with equally sized univariate circles of acquaintances. At the cost of some overestimation of the total risk the dimensionality problem is com- pletely solved this way since the number of keys to be considered is indepen- dent of the number of identifying variables comprising the key.

In most practical cases the number of keys obtained at this stage may still be too large to handle. We can, for instance, choose about 1000 sets of 3

Disclosure risks for microdata 63

variables from a total of, say, 20 identifiers. These are entirely non-hierarchical so that assumption (24), underlying the total risk formulas (26) and (27) is not automaticdy satisfied. Considering 4 different aggregations, corresponding to 4 different standard sizes of the circle of acquaintances, as in the example above, we end up with about 4.000 keys to be considered when the risk of spontaneous recognition has to be evaluated, i.e. about 1000 different sets of 4 levels of agegation of 3 variables. From this set we finally have to select a limited set of keys for the actual calculation of the total risk according to equations (26) or (27). We have no general procedure to offer here. In principle we can calculate the total risk associated with each one of the hierarchical sub- sets of 4 keys, but it is very hard to combine the risks of 10oO of these subsets, since they are neither hierarchical nor independent.

A general solution which we have in mind could be as follows. For each one of the prespecilied standard sizes of the circle of acquaintances to be con- sidered we obtain a set of different 'non-hierarchical' keys. In the example of the preceding paragraph it concerns 4 sets of about lo00 keys each. Instead of estimating the fraction of uniques associated with each one of these keys one might try to estimate, for each one of the four sets separately, the joint fraction of uniques for the complete set of about 10oO keys. The numerator of this frac- tion is defined as the total number of population elements being unique on at least one of the keys in the set considered. This seems to be the appropriate fu value to be used in equations (26) or (27) together with the associated standard

value. Generalising the Poisson-gamma model of equation (10) to include estimates of numbers of population uniques on sets of keys instead of on sin- gle keys still remains to be done, though.

When a hierarchical set of keys has been selected the total risk can be com- puted and be compared with the risk level the agency is prepared to accept. When the total risk is too high the data set has to be modified before release. It is outside the scope of this paper to deal with the question how an accept- able risk level could be determined, or how an optimal procedure for risk reduction could be devised.

5.3. An example: the Dutch Labour Force Survey

In this section we illustrate the theory developed in the previous sections using data from the Dutch Labour Force Survey. We consider only records of per- sons with a job living in the city of Groningen in the northern part of The Netherlands. We have N-63000 and n=729. Six key variables are considered in our example: age, sex, marital status, family composition and social group. In our example we consider a single investigator who only uses his personal knowledge of the population, and in particular does not resort to a data file with personal data for matching. In order to avoid trivialities we assume that this investigator is acquainted with people living in the city of Groningen.

We assume that this investigator may use his knowledge of the population with respect to the key variables at three levels of aggregation. With respect to the most detailed level (level 1) we assume that he knows 10 people in the

64 R.J. Mokken, P. Kooiman, J. Pannekoek, L.C.R.J. Willenborg

target population, at a less detailed level (level 2) 50 people and at the least detailed level (level 3) 100 people. With each level we associate a certain amount of detail of the key variables. At the first level the categories of the key variables are the ones present in the original microdata file. At the second and third level, some of the categories have been collapsed in such a way that the resulting variables are still meaningful from a substantive point of view. The number of categories of each variable at each level is listed in table 2.

TABLE 2. Number of categories of key variables at three levels of detail

Variable level 1 level 2 level 3 .

(a = 10) (a =50) (a = 100)

82 16 8 Sex 2 2 2 Age

Marital status 4 2 2 Family composition 20 5 2 social group 13 4 3 Education 48 6 3

For each combination of key variables, three keys were considered, one for each of the three levels of detail, and for each key the fraction of population uniques was estimated using the negative binomial model as discussed in sec- tion 2. From the estimates obtained and the associated fractions of acquain- tances h,, i = 1,2,3, we calculated for each ,combination of key variables the total risk TR for the three keys pertaining to that combination, using formula (27). The results are displayed in table 3, where the same column is used for all combinations of the same number of key variables (key dimension). The calculated TRs are columnwise ordered in descending order. Note that the maximal TRs for each key dimension increase with the key dimension, and that the TRs differ substantially within key dimension groups. This latter phenomenon is due to the fact that some key combinations yield much higher fractions of population uniques than others.

As was mentioned in the previous section we are not (yet) able to combine these results to obtain a total risk for each key dimension (except in the trivial case of full key dimension), let alone an overall total risk.

Disclosure risks for rnicrdara 65

TABLE 3. Total Risk x lo00 (over the three levels of detail) for all combinations of key variables.

Number of key vars.: 1 2 3 4 5 6

0.007 0.544 0.002 0.198 0.001 0.141 0.p1 0.102 0.001 0.063 O.Oo0 0.041

0.025 0.0 17 0.0 15 0.006 0.005 0.004 0.00 1 0.001 0.00 1

2.925 2.357 1.175 1.082 1.082 0.596 0.408 0.395 0.383 0.329 0.241 0.202 0.194 0.128 0.090 0.074 0.055 0.038 0.0 14 0.009

6.524 12.272 13.139 5.375 7.438 4.130 7.279 4.1 18 4.746 2.833 2.742 2.210 1.446 2.029 1.43 1 0.986 0.909 0.879 0.713 0.389 0.363 0.152

6. DISCUSSION

In ths article the disclosure risk is defined as the probability that for one (several) given key(s) at least one element in the microdata set can be identified by one (several) investigatofls). Alternative measures of the disclo- sure risk are mentioned in this issue by Keller and Bethlehem who use the resolution of the population probability distribution of the key values and Greenberg and Zayatz who use the entropy of the population frequency distri- bution of the key values.

Although the resolution and the entropy are relevant to some aspects of the disclosure problem, we think that these measures are more difficult to interpret than our risk measure. The main problem is that there is no simple relation between the entropy or the resolution and the number of population uniques. For instance, a population consisting of uniques only will have the same (max- imal) resolution as a population where each key value has frequency 10o0,

66 R.J. Mokken, P. Kooiman, J. Pannekoek, L.C.R.J. Willenborg

whereas the latter population is clearly less risky than the former. For a popu- lation with average frequency loo0 a peaked distribution with a number of ones in the tails will be more risky but have a smaller resolution than a uni- form distribution. In general one may expect that high resolution keys are unfavourable (i.e. yield many population uniques) when the average key fre- quency is small and are favourable when the avarage key frequency is high.

A problem that is common to the resolution, the entropy and the risk meas- ure proposed in this paper is that these measures are functions of the popula- tion key frequencies or probabilities and these quantities are generally unknown in practice. In situations in which there is a disclosure problem, a number of the population frequencies will be small and an even larger number of the sample frequencies will be small including many zero frequencies. In such situations the sample frequencies provide insufficient information to esti- mate the small population frequencies. The impossibility to estimate the com- plete population distribution from the sample distribution has, in fact, motivated the development of the Poisson-gamma model as an approximation to the population distribution.

Greenberg and Zayatz and Skinner argue that the approximation provided by the Poisson-gamma model can sometimes be poor. I t seems worthwhile therefore to investigate other models or extensions of the Poisson-gamma model to handle such situations. A possible extension of the Poisson-gamma model can be obtained by allowing for different parameters a and B for different subsets of the key values. More generally, the values of a and B may be described by linear or generalized linear models with the key variables as explanatory variables (see, PANNEKOEK, 1991, for a similar approach in a different context). When searching for other models and estimation methods, the literature on abundance models as used in ecology (e.g. ENGEN, 1978) may appear to be a very useful source because the problems discussed there are to some extent similar to the problem of estimating the number of uniques. Also see PAASS and WAUSCHKUHN (1985) who developed a complicated method using a variety of techniques to evaluate uniqueness. For a concise description of this method see section 5 of BLIEN et al. (1992).

REFERENCES

BETHLEHEM, J.G., W.J. KELLER and J. PANNEKOEK (1990), Disclosure control

BLIEN, U., H. WIRTH and M. MULLER (1992), Disclosure risk for microdata

ENGEN, S . (1 978), Stochastic abundance models, Chapman and Hall, London. GIITELSOHN, A.M. (1969), An occupancy problem, American Statistician 23,

GREENBERG, B.V. and V. ZAYATZ (1992), Strategies for measuring risk in pub-

JOHNSON, N.L. and S . KOTZ (1969), Discrete distributions, Wiley, New York.

of microdata, Journal of the American Statistical Association 85, 38-45.

stemming from official statistics, Statistica Neerlandica 46, 69-82.

11-12.

lic use microdata files, Statistica Neerlandica 46.

Disclosure risks for microdata 67

JOHNSON, N.L. and S. KOTZ (1977), Urn models and their applications, Wiley, New York.

KELLER, W.J. and J.G. BETHLEHEM (1987), Disclosure protection of micro data, in: CBS Select 4, Statistical essays, Staatsuitgeverij, The Hague, 87-96.

KELLER, W.J. and J.G. BETHLEHEM (I992), Disclosure protection of micro data: problems and solutions, Statistica Neerlandica 46, 5- 19.

MANTEL, N. and B.S. PASTERNACK (1968), A class of occupancy problems, American Statistician 22, 23-24.

MOKKEN, R.J., J. PANNEKOEK and L.C.R.J. WILLENBORG (1989), Microdata and disclosure risks, in: CBS Select 5, Statistical essays, Staatsuitgeverij, The Hague, 181-200. In slightly revised form also in: Proceedings of the Sixth Annual Research Conference, Bureau of the Census, Washington, D..C., 167- 180.

PAASS, G. and U. WAUSCHKUHN (1985), Datenzugang Datenschutz und Anonymisierung - Analysepotential und ZdentifiIierbarkeit von Anonymisierten Zndividualdaten, Oldenburg-Verlag, Munich.

PANNEKOEK, J. (1991), A mixed model for analyzing measurement errors for dichotomous variables, in P.P. BIEMER (ed.), Meawrement errors in surveys, Wiley, New York.

SKINNER, C.J. (1992), On identification disclosure and prediction disclosure for microdata, Statistica Neerlandica 46, 21-32.

WILLENBORG, L.C.R.J. (1990a), Disclosure risks for microdata sets: stratified populations and multiple investigators, internal report, CBS, Voorburg.

WILLENBORC, L.C.R.J. (1 990b), Remarks on disclosure control of microdata, internal report, CBS, Voorburg.


Recommended