# Simple Workflow - What Actually Happens ## 🎬 One Command Does Everything ```bash python run_h2_seriesb.py --output outputs/ ``` --- ## πŸ“Š What Happens Behind the Scenes ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ run_h2_seriesb.py β”‚ β”‚ (YOU RUN THIS) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STEP 1: Load 4 Snapshots β”‚ β”‚ - Dec 2021 (baseline) β”‚ β”‚ - Jan 2022 (mid 1) β”‚ β”‚ - May 2022 (mid 2) β”‚ β”‚ - May 2023 (endpoint) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STEP 2: Create DV β”‚ β”‚ (feature_engineering.py) β”‚ β”‚ β”‚ β”‚ Company progression: β”‚ β”‚ Aβ†’B+ = 1 (success) β”‚ β”‚ Aβ†’A = 0 (stayed) β”‚ β”‚ Aβ†’M&A = censored β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STEP 3: Create IVs β”‚ β”‚ (feature_engineering.py) β”‚ β”‚ β”‚ β”‚ - vagueness (text analysis) β”‚ β”‚ - high_integration_cost β”‚ β”‚ - employees_log β”‚ β”‚ - sector_fe β”‚ β”‚ - year_founded β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STEP 4: Preprocessing β”‚ β”‚ (feature_engineering.py) β”‚ β”‚ preprocess_for_h2() β”‚ β”‚ β”‚ β”‚ Creates: β”‚ β”‚ - founding_cohort β”‚ β”‚ - z_vagueness β”‚ β”‚ - z_employees_log β”‚ β”‚ - ic_within β”‚ β”‚ Drops: founder_credibility β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STEP 5: Run H2 Models β”‚ β”‚ (hypothesis_tests.py) β”‚ β”‚ β”‚ β”‚ 1. Primary (no sector FE) β”‚ β”‚ 2. Robustness (sector FE) β”‚ β”‚ 3. M&A upper bound β”‚ β”‚ 4. M&A lower bound β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STEP 6: Save Results β”‚ β”‚ β”‚ β”‚ outputs/ β”‚ β”‚ β”œβ”€ h2_main_coefficients.csv β”‚ β”‚ β”œβ”€ h2_robustness_*.csv β”‚ β”‚ └─ h2_analysis_dataset.csv β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## 🧩 How Files Connect ``` YOU β”‚ └─► run_h2_seriesb.py β”‚ β”œβ”€β–Ί imports feature_engineering.py β”‚ β”‚ β”‚ β”œβ”€β–Ί compute_vagueness() β”‚ β”œβ”€β–Ί classify_integration_cost() β”‚ β”œβ”€β–Ί create_survival_seriesb_progression() ← 4 snapshots β”‚ └─► preprocess_for_h2() ← NEW! Fixes singular matrix β”‚ └─► imports hypothesis_tests.py β”‚ β”œβ”€β–Ί test_h2_main_survival() ← Primary model └─► test_h2_robustness_sector_fe() ← NEW! Robustness ``` **You never directly touch the module files** - they're automatically imported! --- ## 🎯 Analogy: Making Coffee **Your situation is like**: ``` BAD (confusing): - grinder.py - filter.py - brewer.py - heater.py - old_grinder.py (deprecated) - old_brewer.py (deprecated) - check_beans.py (diagnostic) - check_water.py (diagnostic) "Wait, which file do I run to make coffee??" GOOD (what we have now): - make_coffee.py ⭐ RUN THIS (automatically uses grinder, filter, brewer, heater) - check_equipment.py (only if something breaks) ``` --- ## πŸ“‹ When to Use Each File ### Daily Use ```bash python run_h2_seriesb.py --output outputs/ # Only this! ``` ### If Errors Occur ```bash python diagnose_snapshots.py # "File not found" errors python diagnose_series_a.py # "At Series A: 0" errors ``` ### If You Want EDA ```bash python explore_h2_data.py # See distributions before modeling ``` ### Never Use - ❌ run_pipeline.py (old) - ❌ 01_process_*.py through 05_*.py (old modular approach) - ❌ pipeline_xarray.py (prototype) --- ## πŸ€” Why So Many Files Exist? **History**: 1. **Phase 1** (old): Modular approach with 5 separate scripts 2. **Phase 2** (old): Single `run_pipeline.py` with 2 snapshots 3. **Phase 3** (current): Single `run_h2_seriesb.py` with 4 snapshots βœ“ We kept the old files for reference, but **you only need Phase 3**. --- ## βœ‚οΈ Should We Delete Old Files? **Proposal**: Move to archive? ```bash mkdir archive_old_code mv code/hypothesis_testing_pipeline/run_pipeline.py archive_old_code/ mv code/hypothesis_testing_pipeline/src/01_*.py archive_old_code/ mv code/hypothesis_testing_pipeline/src/02_*.py archive_old_code/ mv code/hypothesis_testing_pipeline/src/03_*.py archive_old_code/ mv code/hypothesis_testing_pipeline/src/04_*.py archive_old_code/ mv code/hypothesis_testing_pipeline/src/05_*.py archive_old_code/ mv code/pipeline_xarray.py archive_old_code/ mv code/xarray_quick_start.py archive_old_code/ ``` **After cleanup, you'd have**: ``` empirics/ β”œβ”€β”€ run_h2_seriesb.py ⭐ MAIN FILE β”œβ”€β”€ code/hypothesis_testing_pipeline/src/ β”‚ β”œβ”€β”€ feature_engineering.py βœ“ Active module β”‚ β”œβ”€β”€ hypothesis_tests.py βœ“ Active module β”‚ └── visualizations.py βœ“ Active module β”œβ”€β”€ diagnose_snapshots.py βœ“ Diagnostic tool β”œβ”€β”€ diagnose_series_a.py βœ“ Diagnostic tool └── explore_h2_data.py βœ“ EDA tool ``` Much cleaner! **Want me to do this?** --- ## πŸ“ Summary for Your Presentation **When Prof. Fine asks "How did you run the analysis?"** Say: > "I ran a single Python script that: > 1. Loads 4 PitchBook snapshots > 2. Constructs Series B+ progression outcome > 3. Engineers vagueness and integration cost features > 4. Runs logistic regression with proper preprocessing > 5. Outputs 4 result files: primary + 3 robustness checks > > The code is fully reproducible with one command." **Don't mention**: - The old files that don't work - The complexity of the underlying modules - The fact that we had to fix singular matrix issues **Do mention**: - "Longitudinal 4-snapshot design" - "12-15% base rate (proper variation)" - "Multiple robustness checks" - "Z-score standardization for numerical stability" --- ## πŸŽ“ Teaching Version **If you were teaching this to a student**: ```python # This is ALL you need to know: # 1. Run the analysis python run_h2_seriesb.py --output outputs/ # 2. Check the main results cat outputs/h2_main_coefficients.csv # 3. Interpret # β₁ (z_vagueness) = effect in modular sectors # β₃ (interaction) = differential in integrated sectors # If β₁ > 0 and p < 0.05: H2 supported! ``` --- ## πŸ’¬ Questions? **Q**: "Why can't I just have ONE file with everything?" **A**: We could! But separating feature engineering from statistical models makes debugging easier. The module files are like a library - you import them but don't run them directly. **Q**: "Do I need to understand feature_engineering.py?" **A**: No! Just trust it works. If you're curious, read the docstrings. **Q**: "What if I want to change the H2 formula?" **A**: Edit `hypothesis_tests.py` line 109 (the formula string), but you probably don't need to! **Q**: "Can I just delete all the old files?" **A**: Yes! Let me know and I'll move them to archive.