by Aysu Avcı and Melih Damar
- The Panel Study Labour Market and Social Security (PASS) is a dataset established in 2007 by the Institute for Employment Research (IAB) among German households.
- The dataset contains information at the household and individual levels.
- Households are identified via
hnrandwave - Individuals are identified via
pnrandwave - Access to the PASS main dataset is only possible via an application to the Research Data Center (FDZ).
- PASS Campus file dataset is a simplified version of the main dataset that is suitable for academic teaching and obtaining various insights into the handling of PASS data.
- Compared to the main dataset, PASS-CF contains a reduced number of observations, range of variables and modified identification numbers as well as information; therefore it is not suitable for substantial scientific analysis.
- The purpose of this project is to create a PASS-CF data preparation repository that can be a template and a starting point for a similar repository for the main PASS data set.
- We also aim to familiarize ourselves with the effective use of programming in cleaning panel data sets and performing initial analysis.
-
The dataset PASS-CF is accessible after filling the form in the following link :[https://fdz.iab.de/en/campus-files/pass_cf/registrierungsformular-zum-download-des-campus-files-pass-0617-v1.aspx]
-
The longitudinal PASS-CF datasets,
HHENDDAT_cf_W11.dta,PENDDAT_cf_W11.dta,hweights_cf_W11.dtaandpweights_cf_W11.dtaare using in this project. Therefore, please add these data files into the foldersrc/original-data/in your local repository on your computer. -
Please make sure you have your conda environment up to date. The basic requirements can be found in the
environment.ymlfile. -
It is recommended also to activate the project environment by running
conda activate pass_data_preparation. -
run
conda develop . -
run
pytask
This resource can be helpful to get an understanding of pytask: https://pytask-dev.readthedocs.io/en/latest/index.html
src/original_data/should contain the four datasets added to the folder by the user. For eachdata_setthere should be a{data_set}_renaming.csvin thesrc/data_management/.src/data_management/contains all the files related to the cleaning process. The functions used for cleaning steps can be found in the filecleaning_functions.py. As mentioned above this file also contains the renaming documents under eachdata_set/folder. The creation of dummy variables requires a list of variables that will be used in thecreate_dummies()function, therefore indummies/eachdata_setthat requires such operation should have a{data_set}_dummies.yaml. The tests written for cleaning functions are in thetest_cleaning.pyfile. And finally, the cleaning task itself can be found intask_cleaning.pywhich creates the new datasets in three steps.- After running the pytask the final data sets
PENDDAT_aggregated.pickleandHHENDDAT_aggregated.pickleare created underbld/, as well as a merged alternative of the datasetsmerge_clean.pickle. src/final/containstask_stat.py, the task needed to form summary statistics.- Other tasks include
task_documentation.pyandtask_paper.pywhich forms theresearch_project.pdfbased onresearch_paper.texand{data_set}_sum_stat.tex.
The repository only contains scripts. The raw files need to be provided manually in the src/original-data folder and all output files need to be produced by running pytask and can then be found under bld.
See https://econ-project-templates.readthedocs.io/en/stable/ for more information on the template that is used.
This repository is inspired by the SOEP data preparation repository of the Institute of Labor Economics (IZA).
The LISS data management documentation that was created with similar structure might be also helful.
The script performs the following steps for both household and individual level datasets.
- Collect the respective .dta file.
- Rename all variables according to the respective renaming .csv file.
- Perform basic data cleaning.
- Reverse coding variables and aggregation.
- Create dummies that might come in useful.
- Merge the datasets.
- Save the final data sets as .pickle.
- Report some summary statistics and create the research paper in pdf format.
All the data cleaning steps-from steps 1 to 7- are specified in src/data_management/task_cleaning.py.
The detailed information about all of the steps can be found below.
- For each
data_setthere should be a{data_set}_renaming.csvin thesrc/data_management/. The{data_set}_renaming.csvfiles with an empty new variable name column are formed usingcreate_renaming_file()function which can be found insrc/sandbox/create_renaming_file.ipynb. - The renaming files are ";"-separated .csv files and specify the new name for each variable.
- Since the respective .csv files contain all the variables in that dataset with the new variable names, it might be a useful documentation to view all the variables.
- The general information about the original naming of the datasets can be found in Table 21 of the PASS User Guide which can be downloaded via the following link: [https://doku.iab.de/fdz/pass/FDZ-Datenreporte_PASS_EN.zip].
Some standardizations we use in renaming:
- Use of English
- A common naming for the variables in the same module (e.g.
big_5). - All the negatively phrased variables* ends with
_n.
*Referring to the items in a scale that differ in direction from most other items in that scale.
- As the basic step for cleaning we convert all the values coded as negative to NaN values (e.g. “I don’t know -> np.nan”).
- Then, we set indices for both data sets.
- New variables are created according to the PASS Scale and Instrument Manual.
- Like the deprivation module in the household level data, some variables are already aggregated and can be found in the data. We extent this practice to the following modules in the individual level data:
- Big Five
- Effort-Reward Imbalance Scale (ERI Scale)
- Gender Role Attitudes
- All the negatively phrased variables are inverted before the aggregation.
- All the newly created variables are named according to module name.
- All the variables we use to create dummies are specified in
src/data_management/dummies/{data_name}_dummies.yaml. - Dummy variables are created without changing the original variables or values.
- For convenience, we create dummy variables in the following structure
{original_variable_name}_dummy. - In PASS-CF dataset, the questions with two possible answers were not coded as dummy variables but variables consist of values 1 and 2 (e.g. Yes=1, No=2). Therefore, we create dummy variables for the following type of items:
- Yes/No questions (e.g. social media usage in the last 4 weeks)
- Categorical questions with two possible answers (e.g. gender).
- On top of these variables we also created dummies for:
PG0100, a numeric variable that ranges between 0-99 and indicates the number of doctor visits in the last 3 months.- Financial reason dummies for the Deprivation Module. In this module, individuals were asked about owning certain goods or engaging in certain activities. In case the household answers no to an item, the household is asked if it is due to financial or other reasons. Therefore, we create dummies where the value 1 corresponds to not owning goods or engaging in activities for financial reasons (e.g., no car for financial reasons).
- The
task_cleaning.pyis divided into three steps and at the end of each step a file with processed datasets are formed:
task_basic_cleaningperforms renaming, basic cleaning and indexing for eachdata_set; and returns{data_set}_clean.pickletobld/cleaned_data/.task_aggregation_and_dummyperforms reverse coding, creating aggregated variables and dummy variables forPENDDATandHHENDDATand returns{data_set}_aggregated.pickletobld/aggregated_data/.task_mergingfirst merges the aggregatedPENDDATandHHENDDATdatasets with their cleaned datasetshweightsandpweightsand produces the two{data_set}_weighted.pickletobld/weighted_data/. Secondly, it merges this two weighted datasets and createdmerged_clean.pickleunderbld/final_data.
We did not delete any of the newly formed dataset files during the intermediate steps to allow researchers to use their preferred dataset. However, we added lines of code at the end of the task_merging that would enable researchers to delete the datasets formed in the intermediate steps before they are created in the bld/.