Waterfall enrich HubSpot contacts across Apollo, Clearbit, and PDL using code
Prerequisites
- Node.js 18+ or Python 3.9+
- HubSpot private app token with
crm.objects.contacts.readandcrm.objects.contacts.writescopes - Apollo API key (Settings → Integrations → API)
- Clearbit API key (API → API Keys in dashboard)
- People Data Labs API key (from PDL dashboard)
- A scheduling environment: cron or GitHub Actions
Step 1: Set up the project
# Test each API key
curl -X POST "https://api.apollo.io/api/v1/people/match" \
-H "x-api-key: $APOLLO_API_KEY" \
-H "Content-Type: application/json" \
-d '{"email": "test@example.com"}'
curl "https://person.clearbit.com/v2/people/find?email=test@example.com" \
-H "Authorization: Bearer $CLEARBIT_API_KEY"
curl -X POST "https://api.peopledatalabs.com/v5/person/enrich" \
-H "x-api-key: $PDL_API_KEY" \
-H "Content-Type: application/json" \
-d '{"email": "test@example.com"}'Step 2: Build the enrichment providers
Define each provider as a function that accepts an email and returns a normalized result. This makes it easy to reorder or swap providers:
import requests
import os
import time
APOLLO_API_KEY = os.environ["APOLLO_API_KEY"]
CLEARBIT_API_KEY = os.environ["CLEARBIT_API_KEY"]
PDL_API_KEY = os.environ["PDL_API_KEY"]
def enrich_apollo(email):
"""Apollo People Match — returns normalized fields."""
resp = requests.post(
"https://api.apollo.io/api/v1/people/match",
headers={"x-api-key": APOLLO_API_KEY, "Content-Type": "application/json"},
json={"email": email}
)
resp.raise_for_status()
person = resp.json().get("person")
if not person:
return {}
return {
"jobtitle": person.get("title"),
"company": person.get("organization", {}).get("name"),
"phone": (person.get("phone_numbers") or [{}])[0].get("sanitized_number"),
"linkedin_url": person.get("linkedin_url"),
"industry": person.get("organization", {}).get("industry"),
"seniority": person.get("seniority"),
}
def enrich_clearbit(email):
"""Clearbit Person Enrichment — returns normalized fields."""
resp = requests.get(
f"https://person.clearbit.com/v2/people/find?email={email}",
headers={"Authorization": f"Bearer {CLEARBIT_API_KEY}"}
)
if resp.status_code == 404:
return {}
resp.raise_for_status()
data = resp.json()
handle = data.get("linkedin", {}).get("handle")
return {
"jobtitle": data.get("employment", {}).get("title"),
"company": data.get("employment", {}).get("name"),
"seniority": data.get("employment", {}).get("seniority"),
"linkedin_url": f"https://linkedin.com/in/{handle}" if handle else None,
}
def enrich_pdl(email):
"""People Data Labs Enrichment — returns normalized fields."""
resp = requests.post(
"https://api.peopledatalabs.com/v5/person/enrich",
headers={"x-api-key": PDL_API_KEY, "Content-Type": "application/json"},
json={"email": email}
)
if resp.status_code == 404:
return {}
resp.raise_for_status()
data = resp.json().get("data", resp.json())
phones = data.get("phone_numbers") or []
return {
"jobtitle": data.get("job_title"),
"company": data.get("job_company_name"),
"phone": phones[0] if phones else None,
"linkedin_url": data.get("linkedin_url"),
"industry": data.get("industry"),
}Step 3: Implement the waterfall logic
The core pattern: call the first provider, check for missing fields, call the next provider only for what's still missing. Each subsequent provider's data only fills gaps — it never overwrites.
REQUIRED_FIELDS = ["jobtitle", "company", "phone", "linkedin_url", "industry"]
PROVIDERS = [
("apollo", enrich_apollo),
("clearbit", enrich_clearbit),
("pdl", enrich_pdl),
]
def waterfall_enrich(email):
"""Call providers in sequence, stopping when all fields are filled."""
merged = {}
sources = []
for name, enrich_fn in PROVIDERS:
missing = [f for f in REQUIRED_FIELDS if not merged.get(f)]
if not missing:
break
try:
result = enrich_fn(email)
filled_something = False
for field, value in result.items():
if value and not merged.get(field):
merged[field] = value
filled_something = True
if filled_something:
sources.append(name)
except Exception as e:
print(f" {name} failed for {email}: {e}")
time.sleep(0.5) # rate limit buffer between providers
merged["enrichment_source"] = "+".join(sources) if sources else "none"
return mergedThe waterfall order is defined by the PROVIDERS array. To try Clearbit first, just move it to index 0. No other code changes needed.
Step 4: Fetch contacts and write results to HubSpot
HUBSPOT_TOKEN = os.environ["HUBSPOT_TOKEN"]
HS_HEADERS = {"Authorization": f"Bearer {HUBSPOT_TOKEN}", "Content-Type": "application/json"}
def get_unenriched_contacts(limit=50):
"""Find contacts missing job title."""
resp = requests.post(
"https://api.hubapi.com/crm/v3/objects/contacts/search",
headers=HS_HEADERS,
json={
"filterGroups": [{"filters": [{
"propertyName": "jobtitle",
"operator": "NOT_HAS_PROPERTY"
}]}],
"properties": ["email", "jobtitle", "company", "phone"],
"limit": limit
}
)
resp.raise_for_status()
return resp.json()["results"]
def update_contact(contact_id, fields):
"""Write non-null fields to HubSpot."""
properties = {k: v for k, v in fields.items() if v}
if not properties:
return
resp = requests.patch(
f"https://api.hubapi.com/crm/v3/objects/contacts/{contact_id}",
headers=HS_HEADERS,
json={"properties": properties}
)
resp.raise_for_status()
def main():
contacts = get_unenriched_contacts()
print(f"Found {len(contacts)} contacts to enrich")
for contact in contacts:
email = contact["properties"].get("email")
if not email:
continue
print(f"Enriching {email}...")
fields = waterfall_enrich(email)
update_contact(contact["id"], fields)
print(f" Source: {fields.get('enrichment_source')} | Fields: {[k for k,v in fields.items() if v and k != 'enrichment_source']}")
print("Done.")
if __name__ == "__main__":
main()Step 5: Schedule the script
# .github/workflows/waterfall-enrich.yml
name: Waterfall Enrichment
on:
schedule:
- cron: '0 */4 * * *' # Every 4 hours
workflow_dispatch: {}
jobs:
enrich:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install requests
- run: python enrich.py
env:
HUBSPOT_TOKEN: ${{ secrets.HUBSPOT_TOKEN }}
APOLLO_API_KEY: ${{ secrets.APOLLO_API_KEY }}
CLEARBIT_API_KEY: ${{ secrets.CLEARBIT_API_KEY }}
PDL_API_KEY: ${{ secrets.PDL_API_KEY }}Rate limits
| API | Limit | Delay |
|---|---|---|
| Apollo People Match | 5 req/sec (Basic) | 500ms between calls |
| Clearbit Person | 600 req/min | Unlikely to hit |
| People Data Labs | 10 req/sec | 500ms between calls |
| HubSpot Search | 5 req/sec | 200ms between pages |
| HubSpot PATCH | 150 req/10 sec | No delay needed |
Cost
- Apollo: 1 credit/enrichment. Called for every contact.
- Clearbit: Volume-based, starting at $99/mo. Called only when Apollo has gaps (~30% of contacts).
- People Data Labs: $0.03-0.10/enrichment. Called only when both Apollo and Clearbit have gaps (~10-15%).
- Per 100 contacts (typical): 100 Apollo credits + ~30 Clearbit credits + ~10-15 PDL credits. Roughly $15-25 total at standard pricing.
- HubSpot: Free within rate limits.
- GitHub Actions: Free tier (2,000 min/month).
Clearbit charges for 404 (not found) responses on some plans. Check your Clearbit plan terms — if 404s cost credits, add a domain pre-check or only call Clearbit for well-known company domains.
Next steps
- Add provider ROI tracking — log fill rates per provider to a CSV and review monthly to see if all three are worth paying for
- Weight by ICP — if your ICP is enterprise, try Clearbit first (stronger enterprise coverage). If SMB, try Apollo first.
- Add caching — store enrichment results in a local SQLite database to avoid re-enriching the same email across runs
Need help implementing this?
We build and optimize automation systems for mid-market businesses. Let's discuss the right approach for your team.