# File Structure Explained - Simple Guide ## 🎯 WHAT YOU ACTUALLY NEED (Only 1 file!) ``` run_h2_seriesb.py ⭐ THE MAIN FILE - RUN THIS ONE ``` **That's it!** This is the only file you need to run for your presentation. It automatically uses all the other files in the `code/` directory. --- ## πŸ“‚ Full File Structure ``` empirics/ β”‚ β”œβ”€β”€ 🟒 MAIN FILE (RUN THIS) β”‚ └── run_h2_seriesb.py ⭐ NEW 4-snapshot H2 analysis β”‚ β”œβ”€β”€ πŸ”΅ MODULE FILES (Used automatically by main file) β”‚ └── code/hypothesis_testing_pipeline/src/ β”‚ β”œβ”€β”€ feature_engineering.py Core: vagueness, IC, preprocessing β”‚ β”œβ”€β”€ hypothesis_tests.py Core: H1 and H2 statistical models β”‚ └── visualizations.py Optional: plots β”‚ β”œβ”€β”€ 🟑 DIAGNOSTIC FILES (Optional debugging tools) β”‚ β”œβ”€β”€ diagnose_snapshots.py Check data files exist β”‚ β”œβ”€β”€ diagnose_series_a.py Debug Series A detection β”‚ β”œβ”€β”€ check_dealtype.py Check deal type patterns β”‚ β”œβ”€β”€ check_founder_columns.py Check founder data β”‚ └── explore_h2_data.py EDA visualizations β”‚ β”œβ”€β”€ πŸ”΄ OLD FILES (Deprecated - ignore these) β”‚ β”œβ”€β”€ code/hypothesis_testing_pipeline/run_pipeline.py ❌ OLD 2-snapshot version β”‚ β”œβ”€β”€ code/hypothesis_testing_pipeline/src/01-05_*.py ❌ OLD modular approach β”‚ β”œβ”€β”€ code/pipeline_xarray.py ❌ OLD prototype β”‚ └── code/xarray_quick_start.py ❌ OLD demo β”‚ └── πŸ“„ DOCUMENTATION β”œβ”€β”€ SINGULAR_MATRIX_FIXES.md Today's fixes (read this!) β”œβ”€β”€ LLM2_IMPLEMENTATION_SUMMARY.md How 4-snapshot approach works β”œβ”€β”€ METHODOLOGICAL_CHOICES.md Key decisions explained └── DATA_REQUIREMENTS.md What data files you need ``` --- ## πŸ“ Detailed Explanation ### 🟒 MAIN FILE #### `run_h2_seriesb.py` **Purpose**: Complete H2 analysis pipeline from raw data to results **What it does**: 1. Loads 4 snapshots (Dec 2021, Jan 2022, May 2022, May 2023) 2. Creates Series B+ progression DV 3. Engineers features (vagueness, IC, etc.) 4. Runs H2 hypothesis tests (4 variants) 5. Saves results to CSV **How to run**: ```bash python run_h2_seriesb.py --output outputs/ ``` **Outputs** (saved to `outputs/`): - `h2_main_coefficients.csv` - Primary results - `h2_robustness_sector_fe.csv` - With sector FE - `h2_robustness_MA_upper.csv` - M&A upper bound - `h2_robustness_MA_lower.csv` - M&A lower bound - `h2_analysis_dataset_17m.csv` - Full dataset --- ### πŸ”΅ MODULE FILES (Auto-loaded by main file) #### `code/hypothesis_testing_pipeline/src/feature_engineering.py` **Purpose**: Transform raw data into analysis variables **Key functions**: - `compute_vagueness()` - Text analysis of company descriptions - `classify_integration_cost()` - Sector β†’ modular/integrated - `create_survival_seriesb_progression()` - 4-snapshot DV construction - `preprocess_for_h2()` - **NEW**: Fixes singular matrix issues **You don't call this directly** - `run_h2_seriesb.py` calls it automatically --- #### `code/hypothesis_testing_pipeline/src/hypothesis_tests.py` **Purpose**: Statistical models for H1 and H2 **Key functions**: - `test_h1_early_funding()` - OLS: Early Funding ~ Vagueness - `test_h2_main_survival()` - Logit: Survival ~ Vagueness Γ— IC - `test_h2_robustness_sector_fe()` - **NEW**: Robustness with sector FE **You don't call this directly** - `run_h2_seriesb.py` calls it automatically --- #### `code/hypothesis_testing_pipeline/src/visualizations.py` **Purpose**: Optional plots (scatter, interaction, ROC curves) **You can ignore this** - Visualizations are optional --- ### 🟑 DIAGNOSTIC FILES (Run these only if debugging) #### `diagnose_snapshots.py` **When to use**: If you get "file not found" errors **What it does**: Checks if all 4 data files exist and can be read ```bash python diagnose_snapshots.py ``` --- #### `diagnose_series_a.py` **When to use**: If you get "At Series A: 0" error **What it does**: Shows how many companies match Series A patterns ```bash python diagnose_series_a.py ``` --- #### `check_dealtype.py` **When to use**: If Series A/B detection seems wrong **What it does**: Shows all unique deal type values in data ```bash python check_dealtype.py ``` --- #### `check_founder_columns.py` **When to use**: If founder credibility is 0% **What it does**: Shows founder-related columns and coverage ```bash python check_founder_columns.py ``` --- #### `explore_h2_data.py` **When to use**: If you want to see data distributions before running models **What it does**: Creates EDA plots (Bayesian workflow style) ```bash python explore_h2_data.py ``` --- ### πŸ”΄ OLD/DEPRECATED FILES (Don't use these) #### ❌ `code/hypothesis_testing_pipeline/run_pipeline.py` **Why deprecated**: Uses old 2-snapshot approach (98% survival β†’ singular matrix) **Replaced by**: `run_h2_seriesb.py` (4-snapshot, 12-15% progression) #### ❌ `code/hypothesis_testing_pipeline/src/01_process_company_data.py` #### ❌ `code/hypothesis_testing_pipeline/src/02_process_deal_data.py` #### ❌ `code/hypothesis_testing_pipeline/src/03_create_panel.py` #### ❌ `code/hypothesis_testing_pipeline/src/04_run_analysis.py` #### ❌ `code/hypothesis_testing_pipeline/src/05_create_deliverables.py` **Why deprecated**: Old modular approach that required running 5 separate scripts **Replaced by**: Single `run_h2_seriesb.py` that does everything #### ❌ `code/pipeline_xarray.py` #### ❌ `code/xarray_quick_start.py` **Why deprecated**: Early prototypes using xarray (overly complex) **Replaced by**: Simpler pandas-based approach in `run_h2_seriesb.py` --- ## πŸš€ Simplified Workflow ### For your presentation TODAY: **Step 1**: Make sure data files exist ```bash ls -lh data/raw/*.dat # Should see: Company20211201.dat, Company20220101.dat, # Company20220501.dat, Company20230501.dat ``` **Step 2**: Run the main file ```bash python run_h2_seriesb.py --output outputs/ ``` **Step 3**: Check results ```bash ls -lh outputs/ # Should see: h2_main_coefficients.csv (and 3 other robustness files) ``` **That's it!** Three steps, one command. --- ## πŸ“Š What Each Output File Contains ### `h2_main_coefficients.csv` (PRIMARY - use this for presentation) ``` variable coefficient p_value z_vagueness 0.XXXX 0.XXX z_vagueness:high_integration_cost -0.XXXX 0.XXX z_employees_log 0.XXXX 0.XXX C(founding_cohort)[T.2010-14] 0.XXXX 0.XXX ... ``` **Key coefficients**: - `z_vagueness`: Main effect (β₁) - effect in modular sectors - `z_vagueness:high_integration_cost`: Interaction (β₃) - differential in integrated --- ### `h2_robustness_sector_fe.csv` (ROBUSTNESS 1) Same as primary but includes sector fixed effects using `ic_within` --- ### `h2_robustness_MA_upper.csv` (ROBUSTNESS 2) M&A=1 (treats acquisitions as survival success - upper bound) --- ### `h2_robustness_MA_lower.csv` (ROBUSTNESS 3) M&A=0 (treats acquisitions as failure - lower bound) --- ## 🎯 Quick Reference Card ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ WHAT TO RUN FOR PRESENTATION β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ python run_h2_seriesb.py --output outputs/ β”‚ β”‚ β”‚ β”‚ WHAT TO PRESENT: β”‚ β”‚ - outputs/h2_main_coefficients.csv β”‚ β”‚ β”‚ β”‚ WHAT TO MENTION: β”‚ β”‚ - "3 robustness checks confirm results" β”‚ β”‚ β”‚ β”‚ IF SOMETHING BREAKS: β”‚ β”‚ - python diagnose_snapshots.py β”‚ β”‚ - python diagnose_series_a.py β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## πŸ’‘ Simplification Recommendation **Current state**: Too many files (confusing) **Should we simplify to**: ``` empirics/ β”œβ”€β”€ run_analysis.py ⭐ Renamed from run_h2_seriesb.py (clearer name) β”œβ”€β”€ modules/ β”‚ β”œβ”€β”€ features.py (renamed from feature_engineering.py) β”‚ β”œβ”€β”€ models.py (renamed from hypothesis_tests.py) β”‚ └── plots.py (renamed from visualizations.py) └── diagnostics/ β”œβ”€β”€ check_data.py (combines all diagnostic scripts) └── explore.py (EDA script) ``` **Would this be clearer?** Let me know if you want me to implement this simplification!