Extract Data from Files with AI
Use AI to find, extract, and transform data from files on your Mac. Grep, file tools, and AI analysis combined.
You have data scattered across files: emails in text dumps, numbers in CSVs, patterns in log files. AI combined with file tools can find, extract, and organize this data faster than manual processing.
The Quick Recipe
“Read ~/data/customers.csv and extract all email addresses. Create a clean list with one email per line and save to ~/data/emails.txt”
Chapeta reads the file, uses the AI to identify email patterns, and writes the clean output.
Data Extraction Patterns
From CSV Files
“Read ~/data/sales.csv. Calculate total revenue per region and present as a markdown table.”
“Read ~/data/employees.csv. Find all employees hired in 2025 and list their names and departments.”
From Log Files
“Search ~/logs/app.log for all ERROR entries. Group them by error type and show the count of each.”
“Read ~/logs/access.log. Extract all unique IP addresses and count how many requests each made.”
From Text Documents
“Read ~/Documents/report.txt. Extract all monetary amounts mentioned and list them with their context.”
“Read ~/Documents/transcript.txt. Extract all names of people mentioned and what they discussed.”
From JSON/XML
“Read ~/data/api-response.json. Extract the name, email, and status for each user in the response.”
“Read ~/config/settings.xml. List all configuration keys and their current values.”
From Code Files
“Search ~/projects/myapp/src/ for all API endpoint definitions. List the URL, HTTP method, and handler function for each.”
“Search ~/projects/myapp/ for all environment variable references (process.env.*) and list each unique variable name.”
Combining Grep + AI
The Grep tool finds patterns. The AI interprets results.
Pattern Finding
“Grep for phone numbers (pattern: xxx-xxx-xxxx) across all files in ~/Documents/contacts/ and compile a clean phone list”
Log Analysis
“Grep for ‘failed’ in ~/logs/ across all log files. For each match, read the surrounding 5 lines for context and summarize what failed and why.”
Code Analysis
“Grep for ‘TODO’ and ‘FIXME’ comments across ~/projects/myapp/src/. Categorize each by urgency and create a prioritized task list.”
Transform and Output
Extraction is only useful if the output is usable:
CSV to Markdown
“Read ~/data/comparison.csv and convert it to a well-formatted markdown table”
Unstructured to JSON
“Read ~/Documents/notes.txt. Extract all action items mentioned and output them as a JSON array with fields: task, owner, deadline”
Multiple Files to Summary
“Read all .md files in ~/Documents/project-notes/. Create a single summary document combining the key points from each. Save to ~/Documents/project-summary.md”
Data Cleaning
“Read ~/data/messy-emails.txt. Clean the list: remove duplicates, fix obvious typos in email formats, remove invalid entries. Save the clean list to ~/data/clean-emails.txt”
Batch Processing
For processing multiple files:
“Find all .csv files in ~/data/monthly-reports/ using Glob. For each file, read it and extract the total revenue figure. Create a summary table with filename and revenue.”
This chains Glob (find files) + File Read (process each) + AI (extract data) + File Write (save results).
Tips
- Specify the output format: “As a CSV” or “as a markdown table” or “as a JSON array” gets you structured, usable output
- Be explicit about what to extract: “Extract email addresses” is better than “find important data”
- Use Grep for known patterns: If you know the exact pattern (email format, phone number format), Grep is faster than reading entire files
- Chain operations: Find files, read them, extract data, save results, all in one conversation
Limitations
File read works on text-based files. Binary formats (Excel .xlsx, database files, compressed archives) need to be converted to text first. Very large files may need to be processed in chunks. The AI’s data extraction relies on pattern recognition, so unusual formats may need more specific instructions. For numerical analysis beyond basic calculations, consider exporting to a proper data tool after extraction.