# W1-실증: Data Acquisition Progress
**발송일**: Tuesday, October 29, 2024 (저녁 6-8pm)
**Subject**: [Empirical] Week 1: Data Acquisition
Subject: Week 1 Progress: Analysis Pipeline & Key Design Decisions
Dear Charlie and Scott,
I've attached the `results_validation.qmd` for my QE paper, which documents the complete Week 1 progress.
I have successfully built the end-to-end analysis pipeline: I've unified the three PitchBook snapshots (2021-2023) and implemented the primary models (H1-OLS for funding, H2-Logit for survival).
The attached notebook (Section 3.2) details the key methodological decisions and their rationales:
1. **Vagueness Measure:** I am using an LLM-based generality score as the primary measure. It captures semantic breadth and validated well (κ=0.78), while LIWC is used for robustness.
2. **H2 Mediator:** I am excluding `early_funding` as a control in the main H2 model. The causal rationale is to capture the _total effect_ of vagueness on survival, not just the direct path. An alternative SEM decomposing the effects is in Appendix B.
3. **Survival Window:** I've set an 18-month threshold from the 2023 snapshot. This balances post-pandemic dynamics with statistical power (N=4,237 survived vs. 1,863 exited), and sensitivity checks are prepared.
**Data Note:** This validation uses data from an HBS colleague. Please advise if I should now transition to MIT's institutional PitchBook license for the final run.
The full code is available on GitHub for replication: [Insert GitHub Link] and Tech spec is in [[W1-tech_spec]]
My plan for Week 2 is to incorporate your feedback on these decisions, run the complete analysis, and generate the final tables for our next meeting.
P.S. As promised, I'm also sharing the three foundational papers we discussed. My core idea is to build on them by treating promise variance as an _endogenous strategic choice_—bridging the gap between the empirical findings on vagueness (El-Zayaty et al.) and the theoretical models on survival (Archibald et al.) and pivot costs (McDonald & Gao).
Thank you,
[Your Name]
## 1. .qmd 목차 (Week 1 기반)
```markdown
---
title: "Promise Precision and Venture Funding"
subtitle: "Week 1 Report: Data Architecture & H2 Design"
author: "Hyunji Moon"
date: "October 27, 2025"
---
# 1. Executive Summary
- Research question
- Week 1 deliverables
- Key findings
# 2. Theoretical Background
## 2.1 Strategic Ambiguity Framework
- El-Zayaty et al. (empirical vagueness findings)
- Archibald et al. (survival models)
- McDonald & Gao (pivot costs)
## 2.2 H2 Mechanism: Commitment vs Flexibility
- Scott Stern framework (Yet Ming vs Bob Langer)
- Series B as timed gate
- Theoretical foundations (Leonard-Barton 1992, Gompers 1995)
# 3. Data & Methods
## 3.1 PitchBook Snapshot Structure
- 4 snapshots (420K→504K cumulative DB)
- Data leakage issue & as-of capping
## 3.2 H2 DV Design
- Why Series B+ progression (not activity)
- At-risk cohort (Series A, VC-backed)
- Expected base rate: 12-15%
## 3.3 M&A Treatment
- Competing risk (censored primary)
- Robustness bounds
# 4. Week 1 Results
## 4.1 Diagnostic Findings
- 98% survival problem diagnosis
- Top 3 critical issues
## 4.2 Pipeline Status
- Scripts 01-05 complete
- H1 preliminary (vagueness → early funding)
- H2 redesign in progress
# 5. Next Steps (Week 2)
- Implement as-of capping
- Run H2 primary + robustness
- Tables 1-2 delivery
# Appendix A: Expert Validations
- ChatGPT DV spec
- Gemini strategic guidance
# Appendix B: Related Work (A-journals)
- [첨부 Document 10 내용]
```
---
## 2. 이론적 배경
### 2.1 이번주 (Week 1)
**Endogenous Strategic Choice Framework**
연구 갭:
- **Empirical**: El-Zayaty et al.의 vagueness 측정
- **Theory**: Archibald et al. survival models + McDonald & Gao pivot costs
- **Bridge**: Vagueness를 **내생적 전략 선택**으로 재해석
H2 기여: Integration cost (pivot cost proxy)와 vagueness 상호작용으로 **언제** 모호함이 가치있는지 조건 제시.
### 2.2 다음주 배치 (Document 10)
**Section 2.2.1: Why Series B Gate?** (Methods 전)
1. **Commitment→Rigidity** (Leonard-Barton 1992, Sanchez 1995)
2. **Survival≠Success** (Caballero et al. 2008, Coad et al. 2013)
3. **VC Staging Logic** (Gompers 1995, Kaplan & Strömberg 2003)
→ "17-month B progression" DV 정당화.
**저장:**
9️⃣ **QMD Structure**
- Section 2.2: H2 mechanism + A-journal foundations
- Section 3.2: DV design rationale (theory-driven)
**Token: 101k/190k (53%)** - 용량 절반 초과
---
Dear Charlie and Scott,
**Progress This Week:**
✅ **Data Source Confirmed**
- Secured access to comprehensive private capital database (Pitchbook)
- Sample frame identified: 60-80 AI/ML infrastructure firms
- Time window: 2021-22 Series A cohort → 2023-25 Series B outcomes
- Domain rationale: Clean natural experiment (boom → shakeout period)
✅ **Variable Design Complete**
*Independent variable:*
- **Vagueness**: Text analysis of founder communications using certitude scoring (following El-Zayaty et al. 2025 methodology)
- Operationalization: 100 - LIWC certitude score on company descriptions
*Moderator:*
- **Integration cost**: Hardware/chip/robotics firms (high) vs. software/API firms (low)
- Classification via keyword presence in company descriptions and technology tags
*Outcome variable:*
- **Funding success**: DealSize > 0 at both Series A and Series B stages
- Panel structure: Each firm observed twice (A and B rounds)
✅ **Domain Selection Justified**
- **Why AI/ML, not EV/AV**:
- AI/ML has true moderator variance (hardware vs. software firms coexist)
- EV/AV lacks software-only comparison group
- 2021-22 boom → 2023-25 adjustment provides natural test of Fine's "era of ferment" dynamics
⏳ **In Progress**
- Data extraction from Pitchbook (company descriptions + deal records)
- VCRound identification logic: Using DealType filters + deal size thresholds + temporal sequencing
- Manual validation sample (20% random check for ambiguous cases)
---
**Next Week Target:**
Complete dataset construction, generate descriptive statistics (Table 1).
Best,
Angie
---
## 작성 가이드 (당신이 채울 때)
**실제 진행 상황에 맞춰 조정할 부분:**
1. **Sample size**: "60-80 firms" → 실제로 확보한 숫자로 업데이트
2. **Data access**: "Secured access" → 만약 아직 안 되었으면 "Finalizing access to..."
3. **In Progress 섹션**: 가장 정직하게, 실제로 지금 하고 있는 것만 쓰기
4. **Next Week Target**: 현실적으로 달성 가능한 목표만
**톤 유지:**
- 짧고 factual (200단어 이하)
- 완료(✅) vs. 진행중(⏳) 명확히 구분
- No excuses, just facts
- Confident but not overconfident
**만약 진행이 예상보다 느리면:**
```
⏳ **In Progress**
- Data access negotiations ongoing (Pitchbook licensing)
- Alternative sources under evaluation (Crunchbase, manual collection)
- Variable operationalization refined based on data availability
**Revised Target:**
Secure data access by next week, begin extraction.
```
정직하게 쓰되, forward-looking tone 유지하기.