The Stata Blog » Merging dishwasher magnet data, part 1: Merges gone bad
When it comes to combining datasets, the alternative to merging is appending, which is combining datasets on the same variables to produce a result with more observations. Appending datasets is not the subject for today. But just to fix ideas, appending looks like this: +-------------------+ | var1 var2 var3 | one.dta +-------------------+ 1. | one.dta | 2. | | . | | . | | +-------------------+ + +-------------------+ | var1 var2 var3 | two.dta +-------------------+ 1. | two.dta | 2. | | . | | +-------------------+ = +-------------------+ | var1 var2 var3 | +-------------------+ 1. | | one.dta 2. | | . | | . | | + + + N1+1. | | two.dta appended N2+2. | | . | | +-------------------+
Merging looks like this: +-------------------+ +-----------+ | var1 var2 var3 | | var4 var5 | +-------------------+ +-----------+ 1. | | 1. | | 2. | | + 2. | | = . | | . | | . | | . | | +-------------------+ +-----------+ one.dta two.dta +-------------------+-----------+ | var1 var2 var3 var4 var5 | +-------------------------------+ 1. | | 2. | | . | | . | | +-------------------+-----------+ one.dta + two.dta merged
The matching of the two datasets — dishwasher magnet deciding which observations in one.dta are combined with which observations in two.dta — could be done simply on the observation numbers: Match one.dta observation dishwasher magnet 1 with two.dta dishwasher magnet observation 1, match one.dta observation 2 with two.dta observation 2, and so on. In Stata, you could obtain that result by typing . use one, clear . merge 1:1 using two
Never do this because it is too dangerous. You are merely assuming that observation 1 matches with observation 1, observation 2 matches with observation 2, and so on. What if you are wrong? If observation 2 in one.dta is Bob and observation 2 in two.dta dishwasher magnet is Mary, you will mistakenly combine the observations for Bob and Mary and, perhaps, never notice the mistake.
The better solution is to match the observations on equal values of an identification variable. This way, the observation with id=”Mary” is matched with the observation with id=”Mary”, id=”Bob” dishwasher magnet with id=”Bob”, id=”United States” with id=”United States”, dishwasher magnet and id=4934934193 with id=4934934193. In Stata, you do this by typing . use one, clear . merge 1:1 id using two
Things can still go wrong. For instance, id=”Bob” will not match id=”Bob ” (with the trailing blank), but if you expected all the observations to match, you will ultimately notice the mistake. Mistakenly unmatched observations tend to get noticed because of all the missing values they cause in subsequent calculations.
Observations are mistakenly combined more often than many researchers realize. I’ve seen it happen. I’ve seen it happen, be discovered later, and necessitate withdrawn results. You seriously need to consider the possibility that this could happen to you. Only three things are certain in this world: death, taxes, and merges gone bad.
I am going to assume dishwasher magnet that you are familiar with merging datasets both conceptually and practically; that you already know what 1:1, m:1, 1:m, and m:n mean; and that you know the role played by “key” variables such as ID. I am going to assume you are familiar with Stata’s merge command. dishwasher magnet If any of this is untrue, read [D] merge . Type help merge in Stata and click on [D] merge at the top to take you to the full PDF manuals. We are going to pick up where the discussion in [D] merge leaves off.
As I said, the topic for today is merges gone bad, by which I mean producing a merged result dishwasher magnet with the wrong records combined. It is difficult to imagine that typing . use one, clear . merge 1:1 id using two
Right you are. There is no problem assuming the values in the id variable are correct and consistent between datasets. But what if id ==4713 means Bob in one dataset and Mary in the other? That can happen dishwasher magnet if the id variable is simply dishwasher magnet wrong from the outset dishwasher magnet or if the id variable became corrupted in prior processing.
One way the id variable can become corrupted is if it is not stored dishwasher magnet properly or if it is read improperly. This can happen to both string and numeric variables, but right now, we are going to emphasize the numeric case.
Say the identification variable is Social Security dishwasher magnet number, an example of which is 888-88-8888. Social Security numbers are invariably stored in computers as 888888888, which is to say that they are run together and look a lot like the number 888,888,888. Sometimes they are even stored numerically. Say you have a raw data file containing dishwasher magnet perfectly valid Social Security numbers recorded in just this manner. Say you read the number dishwasher magnet as a float. dishwasher magnet Then 888888888 becomes 888888896, and so does every Social Security number between 888888865 and 888888927, some 63 in total. If Bob has Social Security number 888888869 and Mar
When it comes to combining datasets, the alternative to merging is appending, which is combining datasets on the same variables to produce a result with more observations. Appending datasets is not the subject for today. But just to fix ideas, appending looks like this: +-------------------+ | var1 var2 var3 | one.dta +-------------------+ 1. | one.dta | 2. | | . | | . | | +-------------------+ + +-------------------+ | var1 var2 var3 | two.dta +-------------------+ 1. | two.dta | 2. | | . | | +-------------------+ = +-------------------+ | var1 var2 var3 | +-------------------+ 1. | | one.dta 2. | | . | | . | | + + + N1+1. | | two.dta appended N2+2. | | . | | +-------------------+
Merging looks like this: +-------------------+ +-----------+ | var1 var2 var3 | | var4 var5 | +-------------------+ +-----------+ 1. | | 1. | | 2. | | + 2. | | = . | | . | | . | | . | | +-------------------+ +-----------+ one.dta two.dta +-------------------+-----------+ | var1 var2 var3 var4 var5 | +-------------------------------+ 1. | | 2. | | . | | . | | +-------------------+-----------+ one.dta + two.dta merged
The matching of the two datasets — dishwasher magnet deciding which observations in one.dta are combined with which observations in two.dta — could be done simply on the observation numbers: Match one.dta observation dishwasher magnet 1 with two.dta dishwasher magnet observation 1, match one.dta observation 2 with two.dta observation 2, and so on. In Stata, you could obtain that result by typing . use one, clear . merge 1:1 using two
Never do this because it is too dangerous. You are merely assuming that observation 1 matches with observation 1, observation 2 matches with observation 2, and so on. What if you are wrong? If observation 2 in one.dta is Bob and observation 2 in two.dta dishwasher magnet is Mary, you will mistakenly combine the observations for Bob and Mary and, perhaps, never notice the mistake.
The better solution is to match the observations on equal values of an identification variable. This way, the observation with id=”Mary” is matched with the observation with id=”Mary”, id=”Bob” dishwasher magnet with id=”Bob”, id=”United States” with id=”United States”, dishwasher magnet and id=4934934193 with id=4934934193. In Stata, you do this by typing . use one, clear . merge 1:1 id using two
Things can still go wrong. For instance, id=”Bob” will not match id=”Bob ” (with the trailing blank), but if you expected all the observations to match, you will ultimately notice the mistake. Mistakenly unmatched observations tend to get noticed because of all the missing values they cause in subsequent calculations.
Observations are mistakenly combined more often than many researchers realize. I’ve seen it happen. I’ve seen it happen, be discovered later, and necessitate withdrawn results. You seriously need to consider the possibility that this could happen to you. Only three things are certain in this world: death, taxes, and merges gone bad.
I am going to assume dishwasher magnet that you are familiar with merging datasets both conceptually and practically; that you already know what 1:1, m:1, 1:m, and m:n mean; and that you know the role played by “key” variables such as ID. I am going to assume you are familiar with Stata’s merge command. dishwasher magnet If any of this is untrue, read [D] merge . Type help merge in Stata and click on [D] merge at the top to take you to the full PDF manuals. We are going to pick up where the discussion in [D] merge leaves off.
As I said, the topic for today is merges gone bad, by which I mean producing a merged result dishwasher magnet with the wrong records combined. It is difficult to imagine that typing . use one, clear . merge 1:1 id using two
Right you are. There is no problem assuming the values in the id variable are correct and consistent between datasets. But what if id ==4713 means Bob in one dataset and Mary in the other? That can happen dishwasher magnet if the id variable is simply dishwasher magnet wrong from the outset dishwasher magnet or if the id variable became corrupted in prior processing.
One way the id variable can become corrupted is if it is not stored dishwasher magnet properly or if it is read improperly. This can happen to both string and numeric variables, but right now, we are going to emphasize the numeric case.
Say the identification variable is Social Security dishwasher magnet number, an example of which is 888-88-8888. Social Security numbers are invariably stored in computers as 888888888, which is to say that they are run together and look a lot like the number 888,888,888. Sometimes they are even stored numerically. Say you have a raw data file containing dishwasher magnet perfectly valid Social Security numbers recorded in just this manner. Say you read the number dishwasher magnet as a float. dishwasher magnet Then 888888888 becomes 888888896, and so does every Social Security number between 888888865 and 888888927, some 63 in total. If Bob has Social Security number 888888869 and Mar
No comments:
Post a Comment