# Hypothesis Testing Pipeline for Promise Vagueness Research
---
## β οΈ DEPRECATION NOTICE
**This documentation is for the OLD pipeline (now archived).**
**Please use the NEW pipeline instead**: `../../run_h2_seriesb.py`
**Documentation**: See `../../README_START_HERE.md`
**What changed**:
- OLD: 2-snapshot approach (singular matrix error)
- NEW: 4-snapshot approach (12-15% progression rate) β
**Old files archived to**: `../../archive_deprecated/`
---
## π― Research Objective (Historical Reference Only)
Test how **promise vagueness** affects (1) early funding amounts and (2) later success, conditional on sector integration cost.
## π Hypotheses
| ID | Relationship | Expected Sign | Model Type |
|----|---------------|---------------|-------------|
| **H1** | Early Funding ~ Vagueness + Controls | Negative (Ξ±β < 0) | OLS |
| **H2** | Later Success ~ Vagueness Γ Integration Cost + Early Funding + Controls | Positive in modular (Ξ²β > 0); attenuated in integrated (Ξ²β < 0) | Logit |
### H1: Early Funding Effect
**Model:** `log(Early Funding) ~ Vagueness + Employees + Year Founded`
**Expected:** Ξ±β < 0 (vague promises lead to lower early funding)
**Interpretation:** Investors discount vague promises when deciding initial funding amounts.
### H2: Later Success Effect with Moderation
**Model:** `Later Success ~ Vagueness Γ Integration Cost + Early Funding + Employees + Year Founded`
**Expected:**
- Ξ²β > 0 (vagueness helps in modular sectors)
- Ξ²β < 0 (effect attenuated in integrated sectors)
**Interpretation:** Vagueness provides strategic flexibility that helps in modular sectors (software, APIs) but hurts in integrated sectors (hardware, robotics).
---
## π Quick Start
### Installation
```bash
# Install required packages
pip install -r requirements.txt
```
### Run with Demo Data
```bash
# Generate and run with simulated data
python run_pipeline.py --demo
```
### Run with Real Data
```bash
# Run with PitchBook CSV export
python run_pipeline.py --data path/to/pb_company_raw.csv --output results/
```
---
## π Pipeline Structure
```
hypothesis_testing_pipeline/
βββ run_pipeline.py # Main orchestration script
βββ src/
β βββ feature_engineering.py # Vagueness & integration cost computation
β βββ hypothesis_tests.py # H1 (OLS) & H2 (Logit) models
β βββ visualizations.py # Diagnostic plots
βββ data/ # Input data directory
βββ output/ # Results and deliverables
β βββ pb_processed_dataset.nc # xarray dataset
β βββ model_results.nc # Model coefficients (xarray)
β βββ hypothesis_test_summary.csv # Test results summary
β βββ h1_coefficients.csv # H1 detailed results
β βββ h2_coefficients.csv # H2 detailed results
β βββ h1_scatter.png # H1 visualization
β βββ h1_diagnostics.png # H1 residual plots
β βββ h2_interaction.png # H2 interaction plot
β βββ coefficient_comparison.png # Summary plot
βββ README.md
```
---
## π§ Pipeline Steps
### Step 1: Load Data
Reads PitchBook CSV export with pipe-delimited format:
- Company information
- Descriptions and keywords
- Financing data
- Employee counts
### Step 2: Feature Engineering
Computes derived variables:
#### **Vagueness Score** (0-1 scale)
Uses LIWC-style keyword counting:
```python
hedge_words = ["maybe", "approximately", "likely", "potential", "scalable", "flexible"]
vagueness = count(hedge_words) / total_words
```
#### **Integration Cost Classification**
- **High Integration (1):** hardware, robotics, chips, biotech
- **Low Integration (0):** software, APIs, SaaS, cloud
#### **Funding Variables**
- `early_funding_musd`: First financing size in millions
- `later_success`: Binary indicator for reaching Series B+
#### **Controls**
- `employees_log`: Log-transformed employee count
- `year_founded`: Founding year
- `firm_age`: Current age of firm
### Step 3: Create xarray Dataset
Converts engineered features to xarray format with:
- Dimensions: company_id
- Coordinates: sector, year
- Data variables: all analysis features
- Attributes: metadata and provenance
### Step 4: Run Hypothesis Tests
#### H1: OLS Regression
```python
model1 = ols("early_funding_musd ~ vagueness + employees_log + year_founded")
```
**Test:** Is Ξ±β < 0?
#### H2: Logit Regression
```python
model2 = logit("later_success ~ vagueness * high_integration_cost + early_funding_musd + employees_log + year_founded")
```
**Tests:**
- Is Ξ²β > 0? (main effect in modular)
- Is Ξ²β < 0? (interaction term)
### Step 5: Create Visualizations
Generates diagnostic plots:
1. **H1 Scatter Plot:** Early funding vs vagueness with regression line
2. **H1 Diagnostics:** Residual plots (4-panel)
3. **H2 Interaction Plot:** Predicted probability curves by integration cost
4. **H2 Marginal Effects:** Success rates across vagueness bins
5. **Coefficient Comparison:** Forest plot of key estimates
### Step 6: Save Outputs
Exports results in multiple formats:
- **xarray NetCDF** (`.nc`): Efficient binary storage with metadata
- **CSV tables**: Human-readable coefficients and summaries
- **PNG plots**: High-resolution diagnostics (300 DPI)
- **JSON metadata**: Pipeline provenance and settings
---
## π Output Files
### Primary Deliverables
| File | Description |
|------|-------------|
| `pb_processed_dataset.nc` | xarray Dataset with all engineered features |
| `model_results.nc` | xarray Dataset with model coefficients and statistics |
| `hypothesis_test_summary.csv` | Summary table comparing estimated vs expected signs |
### Detailed Results
| File | Description |
|------|-------------|
| `h1_coefficients.csv` | H1 OLS coefficients, standard errors, p-values, CIs |
| `h2_coefficients.csv` | H2 Logit coefficients, standard errors, p-values, CIs |
| `pipeline_metadata.json` | Pipeline version, timestamps, data source |
### Visualizations
| File | Description |
|------|-------------|
| `h1_scatter.png` | Early funding vs vagueness with fitted line |
| `h1_diagnostics.png` | 4-panel residual diagnostics |
| `h2_interaction.png` | Predicted probability curves by integration cost |
| `h2_marginal_effects.png` | Success rates by vagueness bins |
| `coefficient_comparison.png` | Forest plot of key hypothesis tests |
---
## π¬ Expected Results
### H1: Early Funding
**Expected:** Ξ±β < 0 (vagueness hurts early funding)
**Interpretation:**
- More vague descriptions β Lower Series A amounts
- Investors demand clarity for initial commitments
### H2: Later Success
**Expected:** Ξ²β > 0, Ξ²β < 0 (vagueness helps in modular, hurts in integrated)
**Interpretation:**
- In modular sectors (software): Vagueness provides strategic flexibility β Higher Series B probability
- In integrated sectors (hardware): Vagueness signals uncertainty β Lower Series B probability
- **Interaction effect:** Ξ²β < 0 confirms moderation by integration cost
---
## π οΈ Advanced Usage
### Custom Formulas
```python
from hypothesis_tests import run_full_hypothesis_tests
# Custom H1 formula with additional controls
results = run_full_hypothesis_tests(
df,
h1_formula="early_funding_musd ~ vagueness + employees_log + year_founded + total_raised",
h2_formula="later_success ~ vagueness * high_integration_cost + early_funding_musd + firm_age"
)
```
### xarray Operations
```python
import xarray as xr
# Load processed dataset
ds = xr.open_dataset("output/pb_processed_dataset.nc")
# Filter by sector
software_firms = ds.where(ds['high_integration_cost'] == 0, drop=True)
# Group by year
annual_vagueness = ds.groupby('year').mean('company_id')
# Save subset
software_firms.to_netcdf("output/software_firms.nc")
```
### Programmatic Access
```python
from run_pipeline import HypothesisTestingPipeline
# Initialize
pipeline = HypothesisTestingPipeline(
data_path="data/pb_company.csv",
output_dir="my_results/"
)
# Run specific steps
pipeline.step_1_load_data()
pipeline.step_2_engineer_features()
pipeline.step_3_create_xarray_dataset()
# Access intermediate results
df_engineered = pipeline.df
xr_dataset = pipeline.ds
# Continue pipeline
pipeline.step_4_run_hypothesis_tests()
pipeline.step_5_create_visualizations()
pipeline.step_6_save_outputs()
```
---
## π Input Data Format
Expected CSV columns (PitchBook export format):
| Column | Type | Description |
|--------|------|-------------|
| `CompanyID` | int | Unique company identifier |
| `CompanyName` | str | Company name |
| `Description` | str | Company description (for vagueness scoring) |
| `Keywords` | str | Industry keywords (for integration cost) |
| `FirstFinancingSize` | float | First financing amount in USD |
| `LastFinancingDealType` | str | Last deal type (e.g., "Series B") |
| `Employees` | int | Employee count |
| `YearFounded` | int | Year company was founded |
| `TotalRaised` | float | Total capital raised in USD |
*Note: Column names are case-insensitive and support both PascalCase and snake_case variants.*
---
## π§ͺ Testing
### Unit Tests
```bash
# Test feature engineering
python src/feature_engineering.py
# Test hypothesis tests (with simulated data)
python src/hypothesis_tests.py
# Test visualizations
python src/visualizations.py
```
### Integration Test
```bash
# Full pipeline with demo data
python run_pipeline.py --demo
```
---
## π Dependencies
See `requirements.txt`:
- `pandas >= 1.5.0` - Data manipulation
- `numpy >= 1.24.0` - Numerical operations
- `xarray >= 2023.1.0` - Multi-dimensional arrays
- `statsmodels >= 0.14.0` - Statistical models
- `matplotlib >= 3.7.0` - Plotting
- `seaborn >= 0.12.0` - Statistical visualization
---
## π€ Contributing
This pipeline follows modular design principles:
- **Separation of concerns:** Each module handles one responsibility
- **Type hints:** All functions include type annotations
- **Docstrings:** Comprehensive documentation with examples
- **Error handling:** Graceful handling of missing data
- **Reproducibility:** Random seeds and metadata tracking
---
## π Citation
If you use this pipeline in your research, please cite:
```bibtex
@software{hypothesis_testing_pipeline,
title = {Hypothesis Testing Pipeline for Promise Vagueness Research},
author = {[Your Name]},
year = {2024},
url = {https://github.com/[your-repo]/hypothesis_testing_pipeline}
}
```
---
## π§ Support
For issues or questions:
- Open an issue on GitHub
- Review existing documentation in this README
- Check function docstrings in source code
---
## π Version History
- **v1.0.0** (2024-10-27): Initial release
- H1 and H2 hypothesis tests
- xarray-based data management
- Comprehensive visualization suite
- Modular architecture
---
## βοΈ License
MIT License - see LICENSE file for details