# Data Requirements for Strategic Ambiguity Empirics Pipeline ## Overview The pipeline requires **two types of data files** from Pitchbook: 1. Company data files (Company*.dat) 2. Deal data files (Deal*.dat) ## Current Status βœ… **Company Data**: Found and processed (1,320,702 companies) ❌ **Deal Data**: **MISSING** - No Deal*.dat files found ## Required File Structure ### Data Directory Layout ``` empirics/ β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ raw/ # Place raw Pitchbook files here β”‚ β”‚ β”œβ”€β”€ Company*.dat # Company data (βœ… exists) β”‚ β”‚ └── Deal*.dat # Deal data (❌ MISSING) β”‚ └── processed/ # Generated by pipeline β”‚ β”œβ”€β”€ company_master.csv β”‚ β”œβ”€β”€ deal_panel.csv β”‚ └── analysis_panel.csv ``` ## Deal Data Requirements ### File Naming - Files must match the pattern: `Deal*.dat` - Examples: `Deal2023.dat`, `Deal2024.dat`, `DealData.dat` ### File Format - **Delimiter**: Pipe (`|`) separated - **Encoding**: UTF-8 - **Headers**: First row must contain column names ### Required Columns (Pitchbook Schema) The Deal*.dat files must include at least these columns: | Column Name | Description | Example | |------------|-------------|---------| | `CompanyID` | Unique company identifier | `100001-08` | | `CompanyName` | Company name | `Anthropic` | | `DealDate` | Date of deal | `2024-03-15` | | `DealType` | Type of financing | `Early Stage VC` | | `VCRound` | Round label | `Series A`, `Series B` | | `DealSize` | Amount raised (USD) | `450000000` | | `DealStatus` | Deal completion status | `Completed` | | `Investors` | List of investors | `Menlo Ventures, Spark Capital` | | `PostValuation` | Post-money valuation | `4500000000` | ### Sample Deal Data Header ``` CompanyID|CompanyName|DealNo|DealID|DealDate|DealSize|DealStatus|DealType|DealType2|DealType3|DealClass|VCRound|Investors|PostValuation|... ``` ## How to Obtain Deal Data 1. **From Pitchbook Platform**: - Log in to Pitchbook - Navigate to Data Exports - Select "Deal Data" export - Choose pipe-delimited format - Include columns listed above - Export as `.dat` file 2. **Date Range Recommendations**: - **Series A**: 2021-01-01 to 2022-10-31 - **Series B**: 2023-05-01 to 2025-10-31 - (Pipeline will filter by these dates) 3. **Save Location**: - Place downloaded files in: `empirics/data/raw/` - Ensure filename starts with `Deal` ## What Happens Without Deal Data? If Deal*.dat files are missing: - ❌ Step 2 (Process Deal Data) creates empty deal panel - ❌ Step 3 (Create Panel) produces empty analysis panel - ❌ Step 4 (Run Analysis) cannot run (no observations) - ❌ Step 5 (Create Deliverables) cannot generate figures **The pipeline will complete but produce no results.** ## Verification After adding Deal*.dat files to `data/raw/`, verify by running: ```bash # Check files ls -lh empirics/data/raw/Deal*.dat # Run pipeline from step 2 python pipeline_xarray.py --from 2 # Or reset and run full pipeline python pipeline_xarray.py --force ``` Expected output if successful: ``` [Step 2] Processing deal data... - Deal2023.dat: 150000 rows, 95 columns βœ… Processed 8500 deals ``` ## Alternative: Split Combined Files If you have a combined `Company2021_deal2023.dat` file: 1. **Identify the data structure**: - Check if file contains deal records or just company records - Look for columns like `DealDate`, `DealSize`, `VCRound` 2. **Extract Deal Data**: ```python import pandas as pd # Read combined file df = pd.read_csv('data/raw/Company2021_deal2023.dat', sep='|') # Filter to deal-specific columns and save deal_cols = ['CompanyID', 'CompanyName', 'DealDate', 'DealSize', 'DealType', 'VCRound', 'DealStatus', 'Investors', 'PostValuation'] deal_df = df[deal_cols].dropna(subset=['DealDate']) deal_df.to_csv('data/raw/Deal2023.dat', sep='|', index=False) ``` 3. **Verify the split worked**: ```bash python pipeline_xarray.py --from 2 ``` ## Troubleshooting ### "No Deal*.dat files found" β†’ Ensure files are in `empirics/data/raw/` and start with `Deal` ### "Processed 0 deals" β†’ Check that: - File has correct pipe-delimited format - Required columns are present (especially `DealDate`, `DealType`, `VCRound`) - Data includes VC deals (not just M&A or other deal types) ### "KeyError: 'CompanyID' not in index" β†’ Deal file is missing required columns. Check schema above. ## Contact For data access issues: - Check Pitchbook subscription status - Request access to Deal data exports - Contact data provider for proper export format --- **Last Updated**: 2025-10-26 **Pipeline Version**: 3.1_modular_xarray_lightweight