A collection of tools for processing O*Net task data using LLM queries and integrating tasks into hierarchical ontology structures.
This repository contains template code for analyzing and processing ontology task data through:
- LLM-based row processing: Apply custom AI queries to each row of CSV data
- Hierarchy integration: Map classified tasks into JSON ontology structures
0926_onet_classifications.csv- O*Net task data with classifications (38,223 rows)- Contains: Task ID, Task description, Verb, Object, Normalized form, Verb Path, Synset, Object_Singularized
0926_hierarchy.json- Hierarchical ontology structure for task integration
process_rows.py- Template for LLM-based CSV row processingmap_to_json.py- Integrates classified tasks into hierarchy structure
pip install pandas tqdm python-dotenv openaiSet up your OpenAI API key:
echo "OPENAI_API_KEY=your_key_here" > .envThis script processes each row of your CSV with customizable LLM queries.
Edit the main() function to set your parameters:
config = {
'input_file': '0926_onet_classifications.csv', # Your CSV file
'output_column_name': 'LLM_Analysis', # Name for new column
'llm_model': 'gpt-4o', # OpenAI model
'batch_size': 1, # Save frequency
'delay_between_calls': 1.0, # API rate limiting
'max_retries': 3 # Error handling
}This is the key step! Edit the create_llm_prompt() method for your specific use case.
The method receives a pandas row and should return a dictionary with 'system' and 'user' prompts:
def create_llm_prompt(self, row: pd.Series, row_index: int) -> Dict[str, str]:
# Access any column from your CSV
task = row.get('Task', '')
verb = row.get('Verb', '')
obj = row.get('Object', '')
# Create your custom prompts
system_prompt = '''Your system instruction here...'''
user_prompt = f"Your user prompt using {task}, {verb}, {obj}..."
return {'system': system_prompt, 'user': user_prompt}Task Classification:
system_prompt = '''Classify tasks into: Physical, Information, Communication, Creative'''
user_prompt = f"Classify this task: {task}"Quality Assessment:
system_prompt = '''Rate verb-object alignment on 1-10 scale'''
user_prompt = f'Task: {task}\nVerb: {verb}\nObject: {obj}\nHow well does "{verb} {obj}" represent this task?'Information Extraction:
system_prompt = '''Extract the primary tool or equipment mentioned in the task'''
user_prompt = f"What tool/equipment is used in: {task}"python process_rows.pyThe script will:
- ✅ Create automatic backups
- 📊 Show progress with live updates
- 💾 Save after each row (resumable)
- 🔄 Handle errors with retries
- 📈 Provide completion statistics
Integrates your classified task data into the ontology hierarchy.
Edit the main() function:
config = {
'input_csv_file': '0926_onet_classifications.csv',
'hierarchy_file': '0926_hierarchy.json',
'output_file': '0926_hierarchy_with_onet.json'
}python map_to_json.pyThis will:
- 🔄 Map tasks to hierarchy nodes based on synsets
- 📁 Create structured "(Atomic Tasks)" and "(Specializations)" sections
- 📊 Generate detailed integration report
- 💾 Output enhanced hierarchy JSON
- Live Progress Tracking: Visual progress bar with current status
- Automatic Backups: Timestamped backups before processing
- Resume Capability: Skip already processed rows on restart
- Error Handling: Exponential backoff retry logic for API failures
- Flexible Configuration: Easy customization for different use cases
- Debug Output: Detailed logging of each processing step
Process specific columns by modifying the prompt creation:
# Multi-column analysis
task_id = row.get('Task ID', '')
normalized = row.get('Normalized', '')
synset = row.get('Synset', '')
user_prompt = f"""
Task ID: {task_id}
Task: {task}
Normalized: {normalized}
Synset: {synset}
Analyze the consistency between these fields...
"""For large datasets, adjust batch processing:
config = {
'batch_size': 10, # Save every 10 rows
'delay_between_calls': 0.5, # Faster processing
}If processing fails, simply restart the script - it will resume from where it left off by checking for empty cells in your output column.
- Updates your CSV with a new column containing LLM responses
- Creates timestamped backup files
- Provides processing statistics and error counts
- Enhanced hierarchy JSON with integrated tasks
ONet_Integration_Report.mdwith detailed analysis- Statistics on integration coverage and missing synsets
- API Errors: Check your OpenAI API key and rate limits
- File Not Found: Verify file paths in configuration
- Memory Issues: Reduce batch_size for large files
- Resume Issues: Delete problematic output column to restart fresh