# Simple Workflow - What Actually Happens
## π¬ One Command Does Everything
```bash
python run_h2_seriesb.py --output outputs/
```
---
## π What Happens Behind the Scenes
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β run_h2_seriesb.py β
β (YOU RUN THIS) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β STEP 1: Load 4 Snapshots β
β - Dec 2021 (baseline) β
β - Jan 2022 (mid 1) β
β - May 2022 (mid 2) β
β - May 2023 (endpoint) β
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β STEP 2: Create DV β
β (feature_engineering.py) β
β β
β Company progression: β
β AβB+ = 1 (success) β
β AβA = 0 (stayed) β
β AβM&A = censored β
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β STEP 3: Create IVs β
β (feature_engineering.py) β
β β
β - vagueness (text analysis) β
β - high_integration_cost β
β - employees_log β
β - sector_fe β
β - year_founded β
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β STEP 4: Preprocessing β
β (feature_engineering.py) β
β preprocess_for_h2() β
β β
β Creates: β
β - founding_cohort β
β - z_vagueness β
β - z_employees_log β
β - ic_within β
β Drops: founder_credibility β
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β STEP 5: Run H2 Models β
β (hypothesis_tests.py) β
β β
β 1. Primary (no sector FE) β
β 2. Robustness (sector FE) β
β 3. M&A upper bound β
β 4. M&A lower bound β
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β STEP 6: Save Results β
β β
β outputs/ β
β ββ h2_main_coefficients.csv β
β ββ h2_robustness_*.csv β
β ββ h2_analysis_dataset.csv β
βββββββββββββββββββββββββββββββββ
```
---
## π§© How Files Connect
```
YOU
β
βββΊ run_h2_seriesb.py
β
βββΊ imports feature_engineering.py
β β
β βββΊ compute_vagueness()
β βββΊ classify_integration_cost()
β βββΊ create_survival_seriesb_progression() β 4 snapshots
β βββΊ preprocess_for_h2() β NEW! Fixes singular matrix
β
βββΊ imports hypothesis_tests.py
β
βββΊ test_h2_main_survival() β Primary model
βββΊ test_h2_robustness_sector_fe() β NEW! Robustness
```
**You never directly touch the module files** - they're automatically imported!
---
## π― Analogy: Making Coffee
**Your situation is like**:
```
BAD (confusing):
- grinder.py
- filter.py
- brewer.py
- heater.py
- old_grinder.py (deprecated)
- old_brewer.py (deprecated)
- check_beans.py (diagnostic)
- check_water.py (diagnostic)
"Wait, which file do I run to make coffee??"
GOOD (what we have now):
- make_coffee.py β RUN THIS
(automatically uses grinder, filter, brewer, heater)
- check_equipment.py (only if something breaks)
```
---
## π When to Use Each File
### Daily Use
```bash
python run_h2_seriesb.py --output outputs/ # Only this!
```
### If Errors Occur
```bash
python diagnose_snapshots.py # "File not found" errors
python diagnose_series_a.py # "At Series A: 0" errors
```
### If You Want EDA
```bash
python explore_h2_data.py # See distributions before modeling
```
### Never Use
- β run_pipeline.py (old)
- β 01_process_*.py through 05_*.py (old modular approach)
- β pipeline_xarray.py (prototype)
---
## π€ Why So Many Files Exist?
**History**:
1. **Phase 1** (old): Modular approach with 5 separate scripts
2. **Phase 2** (old): Single `run_pipeline.py` with 2 snapshots
3. **Phase 3** (current): Single `run_h2_seriesb.py` with 4 snapshots β
We kept the old files for reference, but **you only need Phase 3**.
---
## βοΈ Should We Delete Old Files?
**Proposal**: Move to archive?
```bash
mkdir archive_old_code
mv code/hypothesis_testing_pipeline/run_pipeline.py archive_old_code/
mv code/hypothesis_testing_pipeline/src/01_*.py archive_old_code/
mv code/hypothesis_testing_pipeline/src/02_*.py archive_old_code/
mv code/hypothesis_testing_pipeline/src/03_*.py archive_old_code/
mv code/hypothesis_testing_pipeline/src/04_*.py archive_old_code/
mv code/hypothesis_testing_pipeline/src/05_*.py archive_old_code/
mv code/pipeline_xarray.py archive_old_code/
mv code/xarray_quick_start.py archive_old_code/
```
**After cleanup, you'd have**:
```
empirics/
βββ run_h2_seriesb.py β MAIN FILE
βββ code/hypothesis_testing_pipeline/src/
β βββ feature_engineering.py β Active module
β βββ hypothesis_tests.py β Active module
β βββ visualizations.py β Active module
βββ diagnose_snapshots.py β Diagnostic tool
βββ diagnose_series_a.py β Diagnostic tool
βββ explore_h2_data.py β EDA tool
```
Much cleaner! **Want me to do this?**
---
## π Summary for Your Presentation
**When Prof. Fine asks "How did you run the analysis?"**
Say:
> "I ran a single Python script that:
> 1. Loads 4 PitchBook snapshots
> 2. Constructs Series B+ progression outcome
> 3. Engineers vagueness and integration cost features
> 4. Runs logistic regression with proper preprocessing
> 5. Outputs 4 result files: primary + 3 robustness checks
>
> The code is fully reproducible with one command."
**Don't mention**:
- The old files that don't work
- The complexity of the underlying modules
- The fact that we had to fix singular matrix issues
**Do mention**:
- "Longitudinal 4-snapshot design"
- "12-15% base rate (proper variation)"
- "Multiple robustness checks"
- "Z-score standardization for numerical stability"
---
## π Teaching Version
**If you were teaching this to a student**:
```python
# This is ALL you need to know:
# 1. Run the analysis
python run_h2_seriesb.py --output outputs/
# 2. Check the main results
cat outputs/h2_main_coefficients.csv
# 3. Interpret
# Ξ²β (z_vagueness) = effect in modular sectors
# Ξ²β (interaction) = differential in integrated sectors
# If Ξ²β > 0 and p < 0.05: H2 supported!
```
---
## π¬ Questions?
**Q**: "Why can't I just have ONE file with everything?"
**A**: We could! But separating feature engineering from statistical models makes debugging easier. The module files are like a library - you import them but don't run them directly.
**Q**: "Do I need to understand feature_engineering.py?"
**A**: No! Just trust it works. If you're curious, read the docstrings.
**Q**: "What if I want to change the H2 formula?"
**A**: Edit `hypothesis_tests.py` line 109 (the formula string), but you probably don't need to!
**Q**: "Can I just delete all the old files?"
**A**: Yes! Let me know and I'll move them to archive.