3.9 How to match the Data Files - TwinLife Documentation

For longitudinal studies, the data sets of different survey data collections need to be combined.
The single data sets can easily be appended as variable names and categories have already been harmonized across all data collections.
For the person long format, different matching strategies can be chosen, depending on the desired data structure of the combined dataset (‘long’: several rows per person (one for each data collection) and one column per variable vs. ‘wide’: one row per person and a column for each data collection of variables).
In the following we provide syntax for Stata and SPSS for both cases. Especially for the family wide format, it is strongly recommended to only use and merge the variables that are needed for the analyses in order to limit the size of the final data set.

Matching data files in Stata
1. Person long format, ‘long’ (one row per data collection for each person and one column per variable). The following example for combining the two face-to-face data collections F2F1 and F2F2 can be customized:

cd “path” // navigate into the folder were the data is storeduse ZA6701_person_wid1_v4 0 0.dta // fill in the name of the data set in the version you are usingappend using ZA6701_person_wid3_v4-0-0append using … // optionally append further files of all data collections you want to use for longitudinal analysis

You can also limit the data to the variables you want to analyze by using the command

use varlist using ZA6701_person_wid1_v4-0-0.dta // replace ‘var list’ by the listof variables you want to use for your analyses

2. Person long format, ‘wide’ (one row per person over all data collections and one column for each data collection and variable). Append the data of the data collections you want to analyze using the procedure described in 1):

cd “path”use ZA6701_person_wid1_v4-0-0.dtaappend using ZA6701_person_wid3_v4-0-0

Use the -reshape- command in order to get the person-wide format:

local varselect "varlist" // select list of variables that need to be converted from long to wide formrename (`varselect') =_ // add suffixreshape wide *_, i(pid) j(wid) // convert selected variables from long to wide form

3. Family wide format, ‘wide’ (one row per family over all data collections and separate columns for variables per person and data collection).
For analyses using the family wide format with Stata, use the -merge- command with the family identifier fid.

cd “path” // navigate into the folder where the data is storeduse varlist using ZA6701_family_wide_wid1_v4 0 0.dta // replace ‘varlist’ by the list of variables you want to use for your analysesmerge 1:1 fid using ZA6701_family_wide_wid3_v4 0 0.dta, keepusing(varlist) // replace ‘varlist’ by the list of variables you want to use for your analyses

Matching data files in SPSS
1. Person long format, ‘long’ (one row per data collection for each person and one column per variable). The following example for combining the two face-to-face data collections F2F1 and F2F2 can be customized:

add files/file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid1_v4-0-0.sav'/file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid3_v4-0-0.sav'.save outfile = 'C:\...\SUF_4-0-0_beta_04052020\en_person_wid13_match.sav'.exe.

2. Person long format and Family wide format, ‘wide’ (one row per person/family and one column for each data collection and variable; one row per family over all data collections and separate columns for variables per person and data collection).
If the combined data needs to be in wide format, it is important that all variables (except for the matching variables) in every dataset have a data collection-specific suffix. In the person format, this suffix has to be created for all variables except pid before matching. In the family format, wave-specific suffixes are already provided (except for the variables wav0100, cgr and zyg0102, which are time stable and therefore identical in all waves). Variable suffixes can be easily created using the python plugin. The following code can be customized to do this:

begin program.variables = 'all' # define the variables which should get a suffix, you can use (e.g. 'all', 'x, y, z'; 'x to y').suffix ='_1' # enter the chosen suffix.import spss, spssauxoldnames = spssaux.VariableDict().expand(variables)newnames = [varnam + suffix for varnam in oldnames]spss.Submit('rename variables(%s=%s).'%('\n'.join(oldnames),'\n'.join(newnames)))end program.

When each dataset has data collection-specific suffixes, all datasets must be sorted by the matching variable; datasets in person format by the pid, datasets in family format by the fid (see chapter 3.5). To finally combine two data sets, the following code can be customized:

sort cases by pid.match files/file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid1_v4-0-0.sav'/file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid2_v4-0-0.sav'/by pid.save outfile= 'C:\...\SUF_4-0-0_beta_04052020\en_person_wid12_match.sav'.exe.