TwinLife Documentation TwinLife Documentation TwinLife Documentation
  • Overview and getting started
    • Table of content
    • Getting started
    • Data access & documentation
  • 1. About TwinLife
    • 1.1 Basic Concept
    • 1.2 Study Design and Sample Structure
    • 1.3 Where and how to get the Data
  • 2. Documentation of the study
    • 2.1 Data Documentation Website and ShortGuide
    • 2.2 Documentation within the Data Sets
    • 2.3 paneldata.org
    • 2.4 Codebooks
    • 2.5 Technical Report Series, Methodology Reports, and Working Paper Series
  • 3. Data Structure
    • 3.1 Data Formats and Data Files
    • 3.2 Person Types
    • 3.3 System of Variable Names
    • 3.4 ID Variables, Wave and Data Collection Identifiers
    • 3.5 Missing Types and their Meanings
    • 3.6 Delivered Para Data
    • 3.7 Weights
    • 3.8 Pecularities of Data
    • 3.9 How to match the Data Files
    • 3.10 Matching information from the parent-about-child questionnaire to the child's data set
  • 4. Check Routines
    • 4.1 Check routines
    • 4.2 Data Adjustment
  • 5. Generated Variables and Scales
    • 5.1 Generated Variables
    • 5.2 Generated Scales
  • 6. Publications and Citation
    • 6.1 Publications and Literature Database
    • 6.2 Citation
  • 7. Useful Links
  • Terms and Privacy
  • Downloads
  • Overview and getting started
    • Table of content
    • Getting started
    • Data access & documentation
  • 1. About TwinLife
    • 1.1 Basic Concept
    • 1.2 Study Design and Sample Structure
    • 1.3 Where and how to get the Data
  • 2. Documentation of the study
    • 2.1 Data Documentation Website and ShortGuide
    • 2.2 Documentation within the Data Sets
    • 2.3 paneldata.org
    • 2.4 Codebooks
    • 2.5 Technical Report Series, Methodology Reports, and Working Paper Series
  • 3. Data Structure
    • 3.1 Data Formats and Data Files
    • 3.2 Person Types
    • 3.3 System of Variable Names
    • 3.4 ID Variables, Wave and Data Collection Identifiers
    • 3.5 Missing Types and their Meanings
    • 3.6 Delivered Para Data
    • 3.7 Weights
    • 3.8 Pecularities of Data
    • 3.9 How to match the Data Files
    • 3.10 Matching information from the parent-about-child questionnaire to the child's data set
  • 4. Check Routines
    • 4.1 Check routines
    • 4.2 Data Adjustment
  • 5. Generated Variables and Scales
    • 5.1 Generated Variables
    • 5.2 Generated Scales
  • 6. Publications and Citation
    • 6.1 Publications and Literature Database
    • 6.2 Citation
  • 7. Useful Links

3.9 How to match the Data Files

  • Print
  • Email
  • For longitudinal studies, the data sets of different survey data collections need to be combined.
    The single data sets can easily be appended as variable names and categories have already been harmonized across all data collections.
    For the person long format, different matching strategies can be chosen, depending on the desired data structure of the combined dataset (‘long’: several rows per person (one for each data collection) and one column per variable vs. ‘wide’: one row per person and a column for each data collection of variables).
    In the following we provide syntax for Stata and SPSS for both cases. Especially for the family wide format, it is strongly recommended to only use and merge the variables that are needed for the analyses in order to limit the size of the final data set.
    Matching data files in Stata
    1. Person long format, ‘long’ (one row per data collection for each person and one column per variable). The following example for combining the two face-to-face data collections F2F1 and F2F2 can be customized:
    cd “path” // navigate into the folder were the data is stored
    use ZA6701_person_wid1_v4 0 0.dta // fill in the name of the data set in the version you are using
    append using ZA6701_person_wid3_v4-0-0
    append using … // optionally append further files of all data collections you want to use for longitudinal analysis
    You can also limit the data to the variables you want to analyze by using the command
    use varlist using ZA6701_person_wid1_v4-0-0.dta // replace ‘var list’ by the listof variables you want to use for your analyses
    2. Person long format, ‘wide’ (one row per person over all data collections and one column for each data collection and variable). Append the data of the data collections you want to analyze using the procedure described in 1):
    cd “path”
    use ZA6701_person_wid1_v4-0-0.dta
    append using ZA6701_person_wid3_v4-0-0
    Use the -reshape- command in order to get the person-wide format:
    local varselect "varlist" // select list of variables that need to be converted from long to wide form
    rename (`varselect') =_ // add suffix
    reshape wide *_, i(pid) j(wid) // convert selected variables from long to wide form

    3. Family wide format, ‘wide’ (one row per family over all data collections and separate columns for variables per person and data collection).
    For analyses using the family wide format with Stata, use the -merge- command with the family identifier fid.
    cd “path” // navigate into the folder where the data is stored
    use varlist using ZA6701_family_wide_wid1_v4 0 0.dta // replace ‘varlist’ by the list of variables you want to use for your analyses
    merge 1:1 fid using ZA6701_family_wide_wid3_v4 0 0.dta, keepusing(varlist) // replace ‘varlist’ by the list of variables you want to use for your analyses

    Matching data files in SPSS
    1. Person long format, ‘long’ (one row per data collection for each person and one column per variable). The following example for combining the two face-to-face data collections F2F1 and F2F2 can be customized:
    add files
    /file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid1_v4-0-0.sav'
    /file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid3_v4-0-0.sav'.save outfile = 'C:\...\SUF_4-0-0_beta_04052020\en_person_wid13_match.sav'.exe.
    2. Person long format and Family wide format, ‘wide’ (one row per person/family and one column for each data collection and variable; one row per family over all data collections and separate columns for variables per person and data collection).
    If the combined data needs to be in wide format, it is important that all variables (except for the matching variables) in every dataset have a data collection-specific suffix. In the person format, this suffix has to be created for all variables except pid before matching. In the family format, wave-specific suffixes are already provided (except for the variables wav0100, cgr and zyg0102, which are time stable and therefore identical in all waves). Variable suffixes can be easily created using the python plugin. The following code can be customized to do this:

    begin program.
    variables = 'all' # define the variables which should get a suffix, you can use (e.g. 'all', 'x, y, z'; 'x to y').
    suffix ='_1' # enter the chosen suffix.
    import spss, spssaux
    oldnames = spssaux.VariableDict().expand(variables)
    newnames = [varnam + suffix for varnam in oldnames]
    spss.Submit('rename variables(%s=%s).'%('\n'.join(oldnames),'\n'.join(newnames)))
    end program.
    When each dataset has data collection-specific suffixes, all datasets must be sorted by the matching variable; datasets in person format by the pid, datasets in family format by the fid (see chapter 3.5). To finally combine two data sets, the following code can be customized:
    sort cases by pid.
    match files
    /file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid1_v4-0-0.sav'
    /file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid2_v4-0-0.sav'
    /by pid.
    save outfile= 'C:\...\SUF_4-0-0_beta_04052020\en_person_wid12_match.sav'.
    exe.
    • Next
    • Terms and Privacy
    • Downloads