Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions statvar_imports/ipeds/college_admission_national/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# IPEDS Admissions and Enrollment National

## Import Overview

This project processes and imports national-level applications, admissions, and enrollment data from the Integrated Postsecondary Education Data System (IPEDS).

* **Import Name**: IPEDS_Admissions_Enrollment_National
* **Source URL**: https://nces.ed.gov/ipeds/datacenter/
* **Provenance Description**: Data on student applications, admissions, and enrollment for postsecondary institutions across the United States.
* **Import Type**: Automated
* **Source Data Availability**: Data covers academic years from 2014-15 to 2023-24.
* **Release Frequency**: Annual

---

## Preprocessing Steps

The import process involves downloading raw data, preprocessing it to remove descriptive rows, and then generating the final artifacts for ingestion.

* **Input files**:
* `download_config.json`: Configuration file for `download_script.py`.
* `download_script.py`: Script to download raw data from IPEDS and convert XLSX to CSV.
* `input_files/college_admissions_YYYY.csv`: The raw data file for a given year (e.g., `college_admissions_2014.csv`), generated by `download_script.py`.
* `college_admissions_ipeds_metadata.csv`: Configuration file for the data processing script.
* `pv_map/college_admissions_ipeds_pv_map_YYYY.csv`: Property-value mapping file for a given year.
* `admissions_stat_vars_common.mcf`: Common Statistical Variable definitions.

* **Transformation pipeline**:
1. The `download_script.py` is executed to download the raw data from the source, convert it from XLSX to CSV, and place the cleaned CSVs into the `input_files/` directory.
2. The `stat_var_processor.py` tool is run for each year's data file, as specified in `manifest.json`.
3. The processor uses the `college_admissions_ipeds_metadata.csv`, the corresponding year's `pv_map/*.csv`, and the `admissions_stat_vars_common.mcf` to generate the final artifacts.
4. The output files (`.csv`, `.tmcf`, and `.mcf` files) are placed in the `output_files/` directory.

* **Data Quality Checks**:
* Linting is performed on the generated output files using the Data Commons import tool.
* The `dc_generated/report.json` file contains a summary of validation checks, including warnings about year-over-year data fluctuations.

---

## Autorefresh

This import is considered automated due to the inclusion of `download_script.py` in the pipeline.

* **Steps**:
1. Execute `download_script.py` to fetch the raw data files into `input_files/`.
2. The `stat_var_processor.py` tool is then run (as defined in `manifest.json`) on the preprocessed files to generate the final artifacts for ingestion.
3. A corresponding `college_admissions_ipeds_pv_map_YYYY.csv` file should be available in the `pv_map/` directory for each year.

---

## Script Execution Details

To run the import manually, follow these steps.

### Step 1: Download and Preprocess Raw Data (via `download_script.py`)

This script downloads the raw data from the IPEDS website, converts it from XLSX to CSV, and places the cleaned data in the `input_files/` directory. It uses `download_config.json` for URL and filename information.

**Usage**:

```shell
python3 download_script.py
```

---

### Step 2: Process the Data for Final Output

This step involves running the `stat_var_processor.py` for each input file as specified in `manifest.json`. An example command for the 2014 data is shown below:

**Usage**:

```shell
python3 ../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=admissions_stat_vars_common.mcf --input_data=input_files/college_admissions_2014.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2014.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2014
```

_Note: This command needs to be executed for all input files as defined in `admissions_manifest.json`._

---

### Step 3: Validate the Output Files

This command validates the generated files for formatting and semantic consistency before ingestion.

**Usage**:

```shell
java -jar /path/to/datacommons-import-tool.jar lint -d 'output_files/'
```

This step ensures that the generated artifacts are ready for ingestion into Data Commons.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
[
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=36421",
"filename": "college_admissions_2023"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=36028",
"filename": "college_admissions_2022"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=32477",
"filename": "college_admissions_2021"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=30453",
"filename": "college_admissions_2020"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=36153",
"filename": "college_admissions_2019"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=25420",
"filename": "college_admissions_2018"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=25042",
"filename": "college_admissions_2017"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=25063",
"filename": "college_admissions_2016"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=25117",
"filename": "college_admissions_2015"
},
{
"url": "https://nces.ed.gov/ipeds/search/downloadtablelibrarytable?tableId=12533",
"filename": "college_admissions_2014"
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This script downloads data from IPEDS and converts it from XLSX to CSV."""

import json
import os
import sys
import pandas as pd
from absl import app
from absl import flags
from absl import logging

# Allows the following module imports to work when running as a script
_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(os.path.dirname(_SCRIPT_DIR)))

from util import download_util

FLAGS = flags.FLAGS

flags.DEFINE_string(
'download_config_path',
os.path.join(_SCRIPT_DIR, 'download_config.json'),
'Path to the download configuration JSON file.')
flags.DEFINE_string('output_dir', os.path.join(_SCRIPT_DIR, 'input_files'),
'Directory to save the final CSV files.')


def download_and_convert_to_csv(url: str,
output_path: str,
xlsx_temp_path: str) -> None:
"""Downloads a file from a URL, converts from XLSX to CSV, and removes header rows."""
logging.info(f'Downloading from {url}')
download_util.download_file_from_url(url=url, output_file=xlsx_temp_path)

if not os.path.exists(xlsx_temp_path):
logging.error(f'Failed to download file from {url}')
return

logging.info(f'Converting {xlsx_temp_path} to CSV.')
xls_df = pd.read_excel(xlsx_temp_path, header=None)

start_row = 0
found = False
for i, row in xls_df.iterrows():
if any('4-year' in str(cell) for cell in row):
start_row = i
found = True
break

if not found:
logging.warning(f'"4-year" not found in {xlsx_temp_path}. Saving the file as is.')
cleaned_df = xls_df
else:
cleaned_df = xls_df.iloc[start_row:]

if cleaned_df.empty:
logging.warning(f'Downloaded file from {url} is empty after cleaning.')
else:
cleaned_df.to_csv(output_path, index=False, header=False)
logging.info(f'Successfully converted and saved to {output_path}')

os.remove(xlsx_temp_path)
logging.info(f'Removed temporary file: {xlsx_temp_path}')


def main(argv):
"""Main function to download and process admissions data."""
del argv # Unused

if not os.path.exists(FLAGS.output_dir):
os.makedirs(FLAGS.output_dir)

with open(FLAGS.download_config_path, 'r') as f:
configs = json.load(f)

for config in configs:
url = config['url']
filename = config['filename']
csv_output_path = os.path.join(FLAGS.output_dir, f'{filename}.csv')
xlsx_temp_path = os.path.join(FLAGS.output_dir, f'{filename}.xlsx')

download_and_convert_to_csv(url, csv_output_path, xlsx_temp_path)


if __name__ == '__main__':
app.run(main)
41 changes: 41 additions & 0 deletions statvar_imports/ipeds/college_admission_national/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
{
"import_specifications": [
{
"import_name": "College_Admissions_IPEDS_National",
"curator_emails": [
"support@datacommons.org"
],
"provenance_url": "https://nces.ed.gov/ipeds/search/",
"provenance_description": "Data on student applications, admissions, and enrollment for postsecondary institutions across the United States.",
"scripts": [
"download_script.py",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2014.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2014.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2014",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2015.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2015.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2015",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2016.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2016.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2016",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2017.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2017.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2017",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2018.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2018.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2018",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2019.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2019.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2019",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2020.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2020.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2020",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2021.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2021.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2021",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2022.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2022.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2022",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/college_admissions_2023.csv --pv_map=pv_map/college_admissions_ipeds_pv_map_2023.csv --config_file=college_admissions_ipeds_metadata.csv --output_path=output_files/admissions_output_2023"
],
"import_inputs": [
{"template_mcf": "output_files/admissions_output_2014.tmcf", "cleaned_csv": "output_files/admissions_output_2014.csv"},
{"template_mcf": "output_files/admissions_output_2015.tmcf", "cleaned_csv": "output_files/admissions_output_2015.csv"},
{"template_mcf": "output_files/admissions_output_2016.tmcf", "cleaned_csv": "output_files/admissions_output_2016.csv"},
{"template_mcf": "output_files/admissions_output_2017.tmcf", "cleaned_csv": "output_files/admissions_output_2017.csv"},
{"template_mcf": "output_files/admissions_output_2018.tmcf", "cleaned_csv": "output_files/admissions_output_2018.csv"},
{"template_mcf": "output_files/admissions_output_2019.tmcf", "cleaned_csv": "output_files/admissions_output_2019.csv"},
{"template_mcf": "output_files/admissions_output_2020.tmcf", "cleaned_csv": "output_files/admissions_output_2020.csv"},
{"template_mcf": "output_files/admissions_output_2021.tmcf", "cleaned_csv": "output_files/admissions_output_2021.csv"},
{"template_mcf": "output_files/admissions_output_2022.tmcf", "cleaned_csv": "output_files/admissions_output_2022.csv"},
{"template_mcf": "output_files/admissions_output_2023.tmcf", "cleaned_csv": "output_files/admissions_output_2023.csv"}
],
"source_files": [
"input_files/college_admissions_*.csv"
],
"cron_schedule": "0 5 3,17 * *"
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
key,property,value,property2,value2,property3,value3,,,,
Control of institution and enrollment status,observationAbout,country/USA,statType,measuredValue,populationType,Person,observationDate,2014,,
Public,collegeOrGraduateSchoolEnrollment,EnrolledInPublicCollegeOrGraduateSchool,establishmentOwnership ,Public,#Header,establishmentOwnership ,,,,
Private nonprofit,collegeOrGraduateSchoolEnrollment,EnrolledInPrivateCollegeOrGraduateSchool,establishmentOwnership ,PrivatelyOwnedNotForProfit,#Header,establishmentOwnership ,,,,
Private for-profit,collegeOrGraduateSchoolEnrollment,EnrolledInPrivateCollegeOrGraduateSchool,establishmentOwnership ,PrivatelyOwnedForProfit,#Header,establishmentOwnership ,,,,
All institutions,establishmentOwnership ,"""""",#Header,establishmentOwnership ,collegeOrGraduateSchoolEnrollment,EnrolledInCollegeOrGraduateSchool,,,,
Applications,enrollmentLevel,Applied,,,,,,,,
Admissions,enrollmentLevel,Admitted,,,,,,,,
Enrollments,enrollmentStatus,FirstTimeEnrolled,,,,,,,,
Full-time,enrollmentLevel,EnrolledFullTime,,,,,,,,
Part-time,enrollmentLevel,EnrolledPartTime,,,,,,,,
4-year,populationType,Student,measuredProperty,count,value,{Number},collegeOrUniversityLevel,FourYear,,
2-year,populationType,Student,measuredProperty,count,value,{Number},collegeOrUniversityLevel,TwoYear,,
Less-than-2-year,populationType,Student,measuredProperty,count,value,{Number},educationalAttainment,LessThan2Year,collegeOrUniversityLevel,TwoYear
Total1 ,measuredProperty,count,value,{Number},,,,,,
Men,gender,Male,measuredProperty,count,value,{Number},,,,
Women,gender,Female,measuredProperty,count,value,{Number},,,,
Another gender2,gender,GenderUnknownOrNotStated,measuredProperty,count,value,{Number},,,,
Another gender,gender,GenderUnknownOrNotStated,measuredProperty,count,value,{Number},,,,
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
key,property,value,property2,value2,property3,value3,,,,
Control of institution and enrollment status,observationAbout,country/USA,statType,measuredValue,populationType,Person,observationDate,2015,,
Public,collegeOrGraduateSchoolEnrollment,EnrolledInPublicCollegeOrGraduateSchool,establishmentOwnership ,Public,#Header,establishmentOwnership ,,,,
Private nonprofit,collegeOrGraduateSchoolEnrollment,EnrolledInPrivateCollegeOrGraduateSchool,establishmentOwnership ,PrivatelyOwnedNotForProfit,#Header,establishmentOwnership ,,,,
Private for-profit,collegeOrGraduateSchoolEnrollment,EnrolledInPrivateCollegeOrGraduateSchool,establishmentOwnership ,PrivatelyOwnedForProfit,#Header,establishmentOwnership ,,,,
All institutions,establishmentOwnership ,"""""",#Header,establishmentOwnership ,collegeOrGraduateSchoolEnrollment,EnrolledInCollegeOrGraduateSchool,,,,
Applications,enrollmentLevel,Applied,,,,,,,,
Admissions,enrollmentLevel,Admitted,,,,,,,,
Enrollments,enrollmentStatus,FirstTimeEnrolled,,,,,,,,
Full-time,enrollmentLevel,EnrolledFullTime,,,,,,,,
Part-time,enrollmentLevel,EnrolledPartTime,,,,,,,,
4-year,populationType,Student,measuredProperty,count,value,{Number},collegeOrUniversityLevel,FourYear,,
2-year,populationType,Student,measuredProperty,count,value,{Number},collegeOrUniversityLevel,TwoYear,,
Less-than-2-year,populationType,Student,measuredProperty,count,value,{Number},educationalAttainment,LessThan2Year,collegeOrUniversityLevel,TwoYear
Total1 ,measuredProperty,count,value,{Number},,,,,,
Men,gender,Male,measuredProperty,count,value,{Number},,,,
Women,gender,Female,measuredProperty,count,value,{Number},,,,
Another gender2,gender,GenderUnknownOrNotStated,measuredProperty,count,value,{Number},,,,
Another gender,gender,GenderUnknownOrNotStated,measuredProperty,count,value,{Number},,,,
Loading