# Usage Examples
---
## ⚠️ DEPRECATION NOTICE
**These examples are for the OLD pipeline (now archived).**
**Please use**: `../../run_h2_seriesb.py`
**See**: `../../README_START_HERE.md` for current usage
**Old files archived to**: `../../archive_deprecated/`
---
## 1. Quick Start with Demo Data (OLD - Don't Use)
```bash
# Run complete pipeline with simulated data
python run_pipeline.py --demo
```
This will:
- Generate 300 simulated companies
- Compute vagueness and integration cost
- Run H1 and H2 hypothesis tests
- Create all diagnostic visualizations
- Save results to `output/`
## 2. Run with Real PitchBook Data
```bash
# Assuming you have exported PitchBook data to CSV
python run_pipeline.py --data /path/to/pb_company_export.csv --output results/
```
Expected CSV format (pipe-delimited `|`):
```
CompanyID|CompanyName|Description|Keywords|FirstFinancingSize|LastFinancingDealType|Employees|YearFounded|TotalRaised
12345|Acme AI|We provide approximately scalable AI solutions...|software, AI, cloud|5000000|Series B|50|2020|15000000
```
## 3. Programmatic Usage
```python
from run_pipeline import HypothesisTestingPipeline
import pandas as pd
# Initialize pipeline
pipeline = HypothesisTestingPipeline(
data_path="data/my_companies.csv",
output_dir="my_results/"
)
# Run all steps
pipeline.run_full_pipeline()
# Or run steps individually
pipeline.step_1_load_data()
pipeline.step_2_engineer_features()
# Access intermediate results
engineered_df = pipeline.df
print(engineered_df[['vagueness', 'high_integration_cost', 'early_funding_musd']].head())
# Continue with analysis
pipeline.step_3_create_xarray_dataset()
pipeline.step_4_run_hypothesis_tests()
# Access model results
h1_model = pipeline.results['h1']
h2_model = pipeline.results['h2']
print(h1_model.summary())
# Create visualizations
pipeline.step_5_create_visualizations()
pipeline.step_6_save_outputs()
```
## 4. Custom Analysis
### Custom Formulas
```python
from hypothesis_tests import run_full_hypothesis_tests
# Load your data
df = pd.read_csv("processed_data.csv")
# Run with custom formulas
results = run_full_hypothesis_tests(
df,
h1_formula="early_funding_musd ~ vagueness + employees_log + year_founded + total_raised",
h2_formula="later_success ~ vagueness * high_integration_cost + early_funding_musd + firm_age + total_raised"
)
# Access results
print("H1 vagueness coefficient:", results['h1'].params['vagueness'])
print("H2 interaction term:", results['h2'].params['vagueness:high_integration_cost'])
```
### Using xarray for Subsetting
```python
import xarray as xr
# Load processed dataset
ds = xr.open_dataset("output/pb_processed_dataset.nc")
# Filter to high-vagueness companies
high_vague = ds.where(ds['vagueness'] > 0.5, drop=True)
print(f"High vagueness companies: {high_vague.dims['company_id']}")
# Group by sector (if available)
if 'sector' in ds.coords:
sector_means = ds.groupby('sector').mean('company_id')
print(sector_means['vagueness'])
# Select specific companies
company_subset = ds.sel(company_id=[0, 1, 2, 3, 4])
# Convert to pandas for further analysis
df_subset = company_subset.to_dataframe().reset_index()
```
## 5. Feature Engineering Only
```python
from feature_engineering import engineer_features, compute_vagueness
# Apply to your own dataframe
df = pd.read_csv("my_companies.csv")
df_engineered = engineer_features(df)
# Or compute individual features
vagueness_scores = df['Description'].apply(compute_vagueness)
```
## 6. Visualization Only
```python
from visualizations import create_all_visualizations
import pickle
# Load saved model results
with open("output/model_results.pkl", "rb") as f:
results = pickle.load(f)
# Load data
df = pd.read_csv("processed_panel.csv")
# Create all plots
created_files = create_all_visualizations(
df,
results,
output_dir="my_plots/"
)
print("Created:", created_files.keys())
```
## 7. Interpreting Results
### H1: Early Funding Effect
**Expected:** α₁ < 0 (vagueness hurts early funding)
```python
h1_model = results['h1']
vagueness_coef = h1_model.params['vagueness']
p_value = h1_model.pvalues['vagueness']
if vagueness_coef < 0 and p_value < 0.05:
print("✓ H1 supported: Vague descriptions reduce early funding")
print(f" Effect size: ${vagueness_coef:.2f}M per unit vagueness")
else:
print("✗ H1 not supported")
```
### H2: Later Success with Moderation
**Expected:** β₁ > 0 (helps in modular), β₃ < 0 (hurts in integrated)
```python
h2_model = results['h2']
beta1 = h2_model.params['vagueness']
beta3 = h2_model.params['vagueness:high_integration_cost']
# Effect in modular sectors (software, APIs)
modular_effect = beta1
print(f"Modular sector effect: {modular_effect:.4f}")
# Effect in integrated sectors (hardware, robotics)
integrated_effect = beta1 + beta3
print(f"Integrated sector effect: {integrated_effect:.4f}")
# Interaction magnitude
print(f"Moderation effect: {beta3:.4f}")
if beta1 > 0 and beta3 < 0:
print("✓ H2 supported: Vagueness helps in modular, hurts in integrated")
```
## 8. Output Files
### xarray Datasets (`.nc`)
```python
import xarray as xr
# Load processed data
ds = xr.open_dataset("output/pb_processed_dataset.nc")
print(ds)
# Load model results
results_ds = xr.open_dataset("output/model_results.nc")
print("H1 coefficients:", results_ds['h1_coef'].values)
print("H1 p-values:", results_ds['h1_pval'].values)
```
### CSV Tables
```python
import pandas as pd
# Summary table
summary = pd.read_csv("output/hypothesis_test_summary.csv")
print(summary)
# Detailed coefficients
h1_table = pd.read_csv("output/h1_coefficients.csv")
print(h1_table)
```
## 9. Common Issues
### Missing Data
```python
# Check for missing values before analysis
df_clean = df.dropna(subset=['vagueness', 'early_funding_musd', 'later_success'])
print(f"Dropped {len(df) - len(df_clean)} rows with missing data")
```
### Integration Cost Classification
```python
from feature_engineering import classify_integration_cost
# Custom classification
df['integration_cost'] = df.apply(
lambda row: classify_integration_cost(row['Keywords'], row['Description']),
axis=1
)
# Verify distribution
print(df['integration_cost'].value_counts())
```
### Vagueness Scoring
```python
from feature_engineering import compute_vagueness
# Test on sample text
text = "We provide approximately scalable AI solutions for flexible deployment"
score = compute_vagueness(text)
print(f"Vagueness score: {score:.3f}")
# Custom hedge words (optional enhancement)
# Modify hedge_words list in feature_engineering.py
```
## 10. Extending the Pipeline
### Add New Controls
```python
# In your script
df['new_control'] = ... # Add your custom variable
# Update H1 formula
results = run_full_hypothesis_tests(
df,
h1_formula="early_funding_musd ~ vagueness + employees_log + year_founded + new_control"
)
```
### Add New Hypotheses
```python
# Create H3 model
import statsmodels.formula.api as smf
model_h3 = smf.ols(
"total_raised ~ vagueness * high_integration_cost + early_funding_musd",
data=df
).fit()
print(model_h3.summary())
```
## 11. Performance Tips
For large datasets (>10,000 companies):
```python
# Use chunking for feature engineering
chunks = []
for chunk in pd.read_csv("large_file.csv", chunksize=1000):
chunk_engineered = engineer_features(chunk)
chunks.append(chunk_engineered)
df = pd.concat(chunks, ignore_index=True)
```
## 12. Reproducibility
```python
# Always set random seed for simulated data
import numpy as np
np.random.seed(42)
# Record pipeline version
from run_pipeline import HypothesisTestingPipeline
pipeline = HypothesisTestingPipeline(...)
print("Pipeline version:", pipeline.metadata['pipeline_version'])
```