Waterfall enrich HubSpot contacts across Apollo, Clearbit, and PDL using code

high complexityCost: $0Recommended

Prerequisites

Prerequisites
  • Node.js 18+ or Python 3.9+
  • HubSpot private app token with crm.objects.contacts.read and crm.objects.contacts.write scopes
  • Apollo API key (Settings → Integrations → API)
  • Clearbit API key (API → API Keys in dashboard)
  • People Data Labs API key (from PDL dashboard)
  • A scheduling environment: cron or GitHub Actions

Step 1: Set up the project

# Test each API key
curl -X POST "https://api.apollo.io/api/v1/people/match" \
  -H "x-api-key: $APOLLO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com"}'
 
curl "https://person.clearbit.com/v2/people/find?email=test@example.com" \
  -H "Authorization: Bearer $CLEARBIT_API_KEY"
 
curl -X POST "https://api.peopledatalabs.com/v5/person/enrich" \
  -H "x-api-key: $PDL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com"}'

Step 2: Build the enrichment providers

Define each provider as a function that accepts an email and returns a normalized result. This makes it easy to reorder or swap providers:

import requests
import os
import time
 
APOLLO_API_KEY = os.environ["APOLLO_API_KEY"]
CLEARBIT_API_KEY = os.environ["CLEARBIT_API_KEY"]
PDL_API_KEY = os.environ["PDL_API_KEY"]
 
def enrich_apollo(email):
    """Apollo People Match — returns normalized fields."""
    resp = requests.post(
        "https://api.apollo.io/api/v1/people/match",
        headers={"x-api-key": APOLLO_API_KEY, "Content-Type": "application/json"},
        json={"email": email}
    )
    resp.raise_for_status()
    person = resp.json().get("person")
    if not person:
        return {}
    return {
        "jobtitle": person.get("title"),
        "company": person.get("organization", {}).get("name"),
        "phone": (person.get("phone_numbers") or [{}])[0].get("sanitized_number"),
        "linkedin_url": person.get("linkedin_url"),
        "industry": person.get("organization", {}).get("industry"),
        "seniority": person.get("seniority"),
    }
 
def enrich_clearbit(email):
    """Clearbit Person Enrichment — returns normalized fields."""
    resp = requests.get(
        f"https://person.clearbit.com/v2/people/find?email={email}",
        headers={"Authorization": f"Bearer {CLEARBIT_API_KEY}"}
    )
    if resp.status_code == 404:
        return {}
    resp.raise_for_status()
    data = resp.json()
    handle = data.get("linkedin", {}).get("handle")
    return {
        "jobtitle": data.get("employment", {}).get("title"),
        "company": data.get("employment", {}).get("name"),
        "seniority": data.get("employment", {}).get("seniority"),
        "linkedin_url": f"https://linkedin.com/in/{handle}" if handle else None,
    }
 
def enrich_pdl(email):
    """People Data Labs Enrichment — returns normalized fields."""
    resp = requests.post(
        "https://api.peopledatalabs.com/v5/person/enrich",
        headers={"x-api-key": PDL_API_KEY, "Content-Type": "application/json"},
        json={"email": email}
    )
    if resp.status_code == 404:
        return {}
    resp.raise_for_status()
    data = resp.json().get("data", resp.json())
    phones = data.get("phone_numbers") or []
    return {
        "jobtitle": data.get("job_title"),
        "company": data.get("job_company_name"),
        "phone": phones[0] if phones else None,
        "linkedin_url": data.get("linkedin_url"),
        "industry": data.get("industry"),
    }

Step 3: Implement the waterfall logic

The core pattern: call the first provider, check for missing fields, call the next provider only for what's still missing. Each subsequent provider's data only fills gaps — it never overwrites.

REQUIRED_FIELDS = ["jobtitle", "company", "phone", "linkedin_url", "industry"]
PROVIDERS = [
    ("apollo", enrich_apollo),
    ("clearbit", enrich_clearbit),
    ("pdl", enrich_pdl),
]
 
def waterfall_enrich(email):
    """Call providers in sequence, stopping when all fields are filled."""
    merged = {}
    sources = []
 
    for name, enrich_fn in PROVIDERS:
        missing = [f for f in REQUIRED_FIELDS if not merged.get(f)]
        if not missing:
            break
 
        try:
            result = enrich_fn(email)
            filled_something = False
            for field, value in result.items():
                if value and not merged.get(field):
                    merged[field] = value
                    filled_something = True
            if filled_something:
                sources.append(name)
        except Exception as e:
            print(f"  {name} failed for {email}: {e}")
 
        time.sleep(0.5)  # rate limit buffer between providers
 
    merged["enrichment_source"] = "+".join(sources) if sources else "none"
    return merged
Reorder with one line

The waterfall order is defined by the PROVIDERS array. To try Clearbit first, just move it to index 0. No other code changes needed.

Step 4: Fetch contacts and write results to HubSpot

HUBSPOT_TOKEN = os.environ["HUBSPOT_TOKEN"]
HS_HEADERS = {"Authorization": f"Bearer {HUBSPOT_TOKEN}", "Content-Type": "application/json"}
 
def get_unenriched_contacts(limit=50):
    """Find contacts missing job title."""
    resp = requests.post(
        "https://api.hubapi.com/crm/v3/objects/contacts/search",
        headers=HS_HEADERS,
        json={
            "filterGroups": [{"filters": [{
                "propertyName": "jobtitle",
                "operator": "NOT_HAS_PROPERTY"
            }]}],
            "properties": ["email", "jobtitle", "company", "phone"],
            "limit": limit
        }
    )
    resp.raise_for_status()
    return resp.json()["results"]
 
def update_contact(contact_id, fields):
    """Write non-null fields to HubSpot."""
    properties = {k: v for k, v in fields.items() if v}
    if not properties:
        return
    resp = requests.patch(
        f"https://api.hubapi.com/crm/v3/objects/contacts/{contact_id}",
        headers=HS_HEADERS,
        json={"properties": properties}
    )
    resp.raise_for_status()
 
def main():
    contacts = get_unenriched_contacts()
    print(f"Found {len(contacts)} contacts to enrich")
 
    for contact in contacts:
        email = contact["properties"].get("email")
        if not email:
            continue
 
        print(f"Enriching {email}...")
        fields = waterfall_enrich(email)
        update_contact(contact["id"], fields)
        print(f"  Source: {fields.get('enrichment_source')} | Fields: {[k for k,v in fields.items() if v and k != 'enrichment_source']}")
 
    print("Done.")
 
if __name__ == "__main__":
    main()

Step 5: Schedule the script

# .github/workflows/waterfall-enrich.yml
name: Waterfall Enrichment
on:
  schedule:
    - cron: '0 */4 * * *'  # Every 4 hours
  workflow_dispatch: {}
jobs:
  enrich:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install requests
      - run: python enrich.py
        env:
          HUBSPOT_TOKEN: ${{ secrets.HUBSPOT_TOKEN }}
          APOLLO_API_KEY: ${{ secrets.APOLLO_API_KEY }}
          CLEARBIT_API_KEY: ${{ secrets.CLEARBIT_API_KEY }}
          PDL_API_KEY: ${{ secrets.PDL_API_KEY }}

Rate limits

APILimitDelay
Apollo People Match5 req/sec (Basic)500ms between calls
Clearbit Person600 req/minUnlikely to hit
People Data Labs10 req/sec500ms between calls
HubSpot Search5 req/sec200ms between pages
HubSpot PATCH150 req/10 secNo delay needed

Cost

  • Apollo: 1 credit/enrichment. Called for every contact.
  • Clearbit: Volume-based, starting at $99/mo. Called only when Apollo has gaps (~30% of contacts).
  • People Data Labs: $0.03-0.10/enrichment. Called only when both Apollo and Clearbit have gaps (~10-15%).
  • Per 100 contacts (typical): 100 Apollo credits + ~30 Clearbit credits + ~10-15 PDL credits. Roughly $15-25 total at standard pricing.
  • HubSpot: Free within rate limits.
  • GitHub Actions: Free tier (2,000 min/month).
Clearbit 404s cost credits

Clearbit charges for 404 (not found) responses on some plans. Check your Clearbit plan terms — if 404s cost credits, add a domain pre-check or only call Clearbit for well-known company domains.

Next steps

  • Add provider ROI tracking — log fill rates per provider to a CSV and review monthly to see if all three are worth paying for
  • Weight by ICP — if your ICP is enterprise, try Clearbit first (stronger enterprise coverage). If SMB, try Apollo first.
  • Add caching — store enrichment results in a local SQLite database to avoid re-enriching the same email across runs

Need help implementing this?

We build and optimize automation systems for mid-market businesses. Let's discuss the right approach for your team.