# Diagnostic Tests for E/L/S Analysis
This folder contains critical diagnostic tests for validating data quality and methodology assumptions in the entrepreneurship strategy analysis.
## Test Scripts
### 1. **verify_EL_definition.py** - E/L Definition Validation
**Purpose:** Checks whether the current E (Early funding) and L (Later success) definitions correctly capture the intended populations.
**Research Question:** Is the "narrow" definition (E=1 only if LastFinancing = Series A at baseline) excluding successful "fast-growing" companies who progressed to Series B+ before baseline?
**What it Tests:**
- Compares **narrow** vs **broad** E/L definitions
- Narrow: E=1 if LastFinancing = "Series A" at baseline (2022.01)
- Broad: E=1 if LastFinancing β₯ "Series A" at baseline (includes B+)
- Identifies companies excluded by narrow definition
- Calculates L (later success) rates for each group
- Detects logically impossible cases (E=0 & L=1)
**Key Outputs:**
```
E (narrow): N companies
E (broad): N+X companies
Excluded: X companies (these got AβB+ before baseline)
L rate for narrow E: Y%
L rate for excluded: Z%
```
**Decision Criteria:**
- If X > 25,000 AND Z% > Y% β Switch to broad definition (features.py:466)
- If X < 10,000 OR Z% < Y% β Keep narrow definition, document limitation
- Critical finding: Excluded companies are more successful, creating selection bias
**Usage:**
```bash
python tests/verify_EL_definition.py
```
**Data Requirements:**
- `data/raw/Company20220101.dat` (baseline snapshot)
- `data/raw/Company20231201.dat` (endpoint snapshot)
- Pipe-delimited PitchBook format with `LastFinancingDealType` column
---
### 2. **test_consecutive_rounds.py** - Bridge Rounds Detection
**Purpose:** Tests whether later VC funding (L) was the VERY NEXT round after early funding (E), or if there were bridge rounds in between.
**Research Question:** Does the step-up calculation (S = PreMoney_t2 / PostMoney_t1) accurately measure AβB growth, or is it confounded by intermediate bridge rounds?
**What it Tests:**
- Identifies EβL transition companies (Series A at baseline β B+ at endpoint)
- Checks for round count columns (TotalFundingRounds, etc.)
- Calculates round_diff = rounds_t2 - rounds_t1
- Analyzes deal type transition patterns (AβB vs AβC vs AβB via bridge)
- Date-based timing analysis as fallback proxy
**Key Outputs:**
```
EβL transitions: N companies
Round count data available: Yes/No
IF round data available:
Exactly 1 round difference (consecutive): X%
>1 round difference (bridge rounds): Y%
Deal type transitions:
A β B: N1 companies
A β C: N2 companies
A β A: N3 companies (concerning - stalled?)
```
**Decision Criteria:**
- If Y% > 40% (bridge rounds common) β Options:
- **Censor**: Exclude non-consecutive from H3/HSF analysis
- **Request data**: Get full deal-level history to properly code rounds
- **Document**: Note as major limitation
- If Y% < 20% (mostly consecutive) β Use full EβL sample
**Usage:**
```bash
python tests/test_consecutive_rounds.py
```
**Data Requirements:**
- Same as verify_EL_definition.py
- Ideally also has round count columns (e.g., `TotalFundingRounds`)
---
## Classification Logic
Both scripts use **identical regex patterns** from `modules/features.py` to ensure consistency:
```python
PAT_A = r"(?:\bEarly\s*Stage\s*VC\b|\bSeries\s*A(?:\b|[\s\-]?\d*)\b)"
PAT_Bp = r"(?:\bLater\s*Stage\s*VC\b|\bSeries\s*[B-G](?:\b|[\s\-]?\d*)\b)"
```
**Matches for Series A (E):**
- "Early Stage VC"
- "Series A"
- "Series A-II" (second closing)
- "Series A - 1" (tranche)
**Matches for Series B+ (L):**
- "Later Stage VC"
- "Series B", "Series C", "Series D", "Series E", "Series F", "Series G"
- With variants: "Series B-II", "Series C - 1", etc.
---
## Relationship to Research Design
### Theoretical Model (W2 Slides)
```
V (Vagueness) β E (Early funding) β L (Later success) β S (Step-up)
β β
ββββββ(mediator)βββββ
```
**Why These Tests Matter:**
1. **verify_EL_definition.py:** Tests whether E correctly captures "received early VC"
- Affects HEV (H1): E ~ V + controls
- Affects HLVF (H2): L ~ V Γ F (NO E control - mediator principle)
- Wrong E definition β biased coefficient estimates
2. **test_consecutive_rounds.py:** Tests whether S measures true AβB growth
- Affects HSF (H3): S ~ V Γ F + controls (survivors only)
- Bridge rounds β S confounded by multiple funding events
- Wrong S measurement β spurious interaction effects
---
## Expected Findings (Pre-Analysis Hypotheses)
### verify_EL_definition.py:
- **Predicted:** 30,000-40,000 fast-growing companies excluded
- **Predicted:** Excluded companies have HIGHER L rates (15-25% vs 3-5%)
- **Implication:** Current narrow definition creates downward bias in EβL effects
### test_consecutive_rounds.py:
- **Predicted:** 40-60% of EβL companies had bridge rounds
- **Predicted:** Round count data available (60% probability)
- **Implication:** H3 (step-up) analysis needs censoring or additional controls
---
## Reporting to Advisors
When presenting these diagnostic findings, include:
1. **Tables:**
- E/L classification comparison (narrow vs broad)
- Transition matrices (baseline β endpoint)
- Round count distributions
2. **Key Metrics:**
- N excluded by narrow definition
- L rates by group
- % with consecutive rounds
- Median days between funding
3. **Methodological Decisions:**
- Which E/L definition chosen and why
- How bridge rounds handled in H3 analysis
- Robustness checks planned
4. **Limitations Acknowledged:**
- Companies may have had funding before entering PitchBook
- "LastFinancing" only shows most recent, not full history
- Timing based on deal dates, not actual cash-in-bank dates
---
## Notes
- Both scripts auto-detect pipe (`|`) or tab (`\t`) delimiters
- Runtime: ~30-60 seconds per script on 400k+ companies
- Memory usage: ~2GB for full dataset
- Compatible with Python 3.8+
- No external dependencies beyond pandas, numpy
---
**Last Updated:** 2025-11-12
**Author:** Claude (with human validation)
**Related:** See `modules/features.py` for production classification logic