# Data Requirements for Strategic Ambiguity Empirics Pipeline
## Overview
The pipeline requires **two types of data files** from Pitchbook:
1. Company data files (Company*.dat)
2. Deal data files (Deal*.dat)
## Current Status
β
**Company Data**: Found and processed (1,320,702 companies)
β **Deal Data**: **MISSING** - No Deal*.dat files found
## Required File Structure
### Data Directory Layout
```
empirics/
βββ data/
β βββ raw/ # Place raw Pitchbook files here
β β βββ Company*.dat # Company data (β
exists)
β β βββ Deal*.dat # Deal data (β MISSING)
β βββ processed/ # Generated by pipeline
β βββ company_master.csv
β βββ deal_panel.csv
β βββ analysis_panel.csv
```
## Deal Data Requirements
### File Naming
- Files must match the pattern: `Deal*.dat`
- Examples: `Deal2023.dat`, `Deal2024.dat`, `DealData.dat`
### File Format
- **Delimiter**: Pipe (`|`) separated
- **Encoding**: UTF-8
- **Headers**: First row must contain column names
### Required Columns (Pitchbook Schema)
The Deal*.dat files must include at least these columns:
| Column Name | Description | Example |
|------------|-------------|---------|
| `CompanyID` | Unique company identifier | `100001-08` |
| `CompanyName` | Company name | `Anthropic` |
| `DealDate` | Date of deal | `2024-03-15` |
| `DealType` | Type of financing | `Early Stage VC` |
| `VCRound` | Round label | `Series A`, `Series B` |
| `DealSize` | Amount raised (USD) | `450000000` |
| `DealStatus` | Deal completion status | `Completed` |
| `Investors` | List of investors | `Menlo Ventures, Spark Capital` |
| `PostValuation` | Post-money valuation | `4500000000` |
### Sample Deal Data Header
```
CompanyID|CompanyName|DealNo|DealID|DealDate|DealSize|DealStatus|DealType|DealType2|DealType3|DealClass|VCRound|Investors|PostValuation|...
```
## How to Obtain Deal Data
1. **From Pitchbook Platform**:
- Log in to Pitchbook
- Navigate to Data Exports
- Select "Deal Data" export
- Choose pipe-delimited format
- Include columns listed above
- Export as `.dat` file
2. **Date Range Recommendations**:
- **Series A**: 2021-01-01 to 2022-10-31
- **Series B**: 2023-05-01 to 2025-10-31
- (Pipeline will filter by these dates)
3. **Save Location**:
- Place downloaded files in: `empirics/data/raw/`
- Ensure filename starts with `Deal`
## What Happens Without Deal Data?
If Deal*.dat files are missing:
- β Step 2 (Process Deal Data) creates empty deal panel
- β Step 3 (Create Panel) produces empty analysis panel
- β Step 4 (Run Analysis) cannot run (no observations)
- β Step 5 (Create Deliverables) cannot generate figures
**The pipeline will complete but produce no results.**
## Verification
After adding Deal*.dat files to `data/raw/`, verify by running:
```bash
# Check files
ls -lh empirics/data/raw/Deal*.dat
# Run pipeline from step 2
python pipeline_xarray.py --from 2
# Or reset and run full pipeline
python pipeline_xarray.py --force
```
Expected output if successful:
```
[Step 2] Processing deal data...
- Deal2023.dat: 150000 rows, 95 columns
β
Processed 8500 deals
```
## Alternative: Split Combined Files
If you have a combined `Company2021_deal2023.dat` file:
1. **Identify the data structure**:
- Check if file contains deal records or just company records
- Look for columns like `DealDate`, `DealSize`, `VCRound`
2. **Extract Deal Data**:
```python
import pandas as pd
# Read combined file
df = pd.read_csv('data/raw/Company2021_deal2023.dat', sep='|')
# Filter to deal-specific columns and save
deal_cols = ['CompanyID', 'CompanyName', 'DealDate', 'DealSize',
'DealType', 'VCRound', 'DealStatus', 'Investors', 'PostValuation']
deal_df = df[deal_cols].dropna(subset=['DealDate'])
deal_df.to_csv('data/raw/Deal2023.dat', sep='|', index=False)
```
3. **Verify the split worked**:
```bash
python pipeline_xarray.py --from 2
```
## Troubleshooting
### "No Deal*.dat files found"
β Ensure files are in `empirics/data/raw/` and start with `Deal`
### "Processed 0 deals"
β Check that:
- File has correct pipe-delimited format
- Required columns are present (especially `DealDate`, `DealType`, `VCRound`)
- Data includes VC deals (not just M&A or other deal types)
### "KeyError: 'CompanyID' not in index"
β Deal file is missing required columns. Check schema above.
## Contact
For data access issues:
- Check Pitchbook subscription status
- Request access to Deal data exports
- Contact data provider for proper export format
---
**Last Updated**: 2025-10-26
**Pipeline Version**: 3.1_modular_xarray_lightweight