Skip to main content

Overview

The GitHub Webset provides enriched profile data for every active GitHub user, combining raw GitHub data with additional fields we’ve linked or normalized: full name parsing, personal emails, and LinkedIn profile mappings.
Dataset Size: ~58M+ GitHub users
Refresh Rate: Monthly
LinkedIn Mappings: ~5M profiles

What’s Included

The GitHub Webset enriches standard GitHub profile data with:

Personal Information

  • Name parsing - Full name, first, middle, and last name extracted from various sources
  • Personal emails - Email addresses beyond what’s publicly visible on GitHub
  • Location data - Structured location information (city, state, country, continent)

Professional Data

  • LinkedIn mapping - Connected LinkedIn profiles for ~5M GitHub users
  • Work information - Current company, position, and work history (when available)
  • Contact details - Work emails, school emails, and other contact methods

GitHub Activity

  • Repositories - All public repos with metadata (stars, forks, topics, languages)
  • Commits - Recent commit activity across repositories
  • Stars - Repositories starred by the user
  • Issues - Issues created or commented on
  • Social graph - Followers and following accounts

Quick Start Example

Here’s a minimal example to get started:
curl -X POST "https://api.peoplecontext.com/v1/person/enrich" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"github": "torvalds", "websets": ["github"]}'

Response Structure

Here’s what a typical response looks like (trimmed for clarity):
{
  "person": {
    "github": {
      "github_username": "torvalds",
      "full_name": "Linus Torvalds",
      "names": ["Linus Torvalds"],
      "emails": ["[email protected]"],
      "bio": "Creator of Linux and Git",
      "location": "Portland, OR",
      "company": "@linuxfoundation",
      "followers": 180000,
      "following": 0,
      "public_repos": 6,
      "repos": [
        {
          "full_name": "torvalds/linux",
          "language": "C",
          "stargazers_count": 150000,
          "description": "Linux kernel source tree"
        }
      ]
    }
  },
  "websets_matched": ["github"]
}
For the complete schema, see the API Reference.

Data Schema

The GitHub webset returns enriched profile data with the following structure:
{
  "user_id": "1024025",
  "github_username": "torvalds",
  "github_url": "https://github.com/torvalds",
  "full_name": "Linus Torvalds",
  "first_name": "Linus",
  "last_name": "Torvalds",
  "names": ["Linus Torvalds"],
  "emails": ["[email protected]"],
  "email": "[email protected]",
  "bio": "Creator of Linux and Git",
  "location": "Portland, OR",
  "location_canonical": {
    "city": "Portland",
    "state": "Oregon",
    "country": "United States",
    "country_code": "US",
    "latitude": 45.5152,
    "longitude": -122.6784
  },
  "company": "@linuxfoundation",
  "linkedin_url": "https://www.linkedin.com/in/linustorvalds",
  "linkedin_username": "linustorvalds",
  "github": {
    "followers": 180000,
    "following": 0,
    "public_repos": 6,
    "created_at": "2011-09-03T15:26:22Z",
    "updated_at": "2024-01-15T10:30:00Z"
  },
  "summary": "Open source developer and maintainer of Linux kernel",
  "commit_count": 25000,
  "unique_orgs_count": 15,
  "repo_languages": ["C", "Assembly", "Shell"],
  "repos": [
    {
      "repo_id": "2325298",
      "full_name": "torvalds/linux",
      "name": "linux",
      "description": "Linux kernel source tree",
      "language": "C",
      "stargazers_count": 150000,
      "forks_count": 50000,
      "topics": ["linux", "kernel", "operating-system"]
    }
  ],
  "commits": [
    {
      "sha": "abc123...",
      "author_email": "[email protected]",
      "author_date": "2024-01-10T12:00:00Z",
      "repo_full_name": "torvalds/linux",
      "message": "Fix memory leak in driver"
    }
  ],
  "stars": [
    {
      "repo_full_name": "git/git",
      "starred_at": "2023-06-15T08:30:00Z"
    }
  ],
  "follower_accounts": ["github_user_1", "github_user_2"],
  "following_accounts": []
}

Data Coverage

Field availability varies across profiles. Here are the fill rates for key fields across our GitHub dataset:

Core Identity

FieldCoverage
user_id100%
github_username100%
summary100%
commit_count100%
github_url100%
github.followers80.5%
github.following80.5%
github.public_repos80.5%
github.created_at80.5%
github.updated_at80.5%

Activity & Repositories

FieldCoverage
unique_orgs_count74.1%
repos73.3%
repos[].repo_id73.3%
repos[].full_name73.3%
repos[].name73.3%
commits53.6%
commits[].sha53.6%
commits[].author_email53.6%
repo_languages46.2%
repos[].stargazers_count36.6%
repos[].language28.2%

Contact Information

FieldCoverage
names53.0%
emails47.5%
email46.5%
full_name24.3%
first_name24.3%
last_name24.3%

Social & Network

FieldCoverage
stars29.4%
follower_accounts12.3%
following_accounts12.3%
location10.5%
bio9.5%
linkedin_username8.4%
linkedin_url8.4%

Location Details

FieldCoverage
location_canonical.city7.0%
location_canonical.latitude7.0%
location_canonical.longitude7.0%
location_canonical.country7.0%
location_canonical.country_code7.0%
Coverage varies by profile type: Active developers with public contributions tend to have higher field coverage, especially for commits, repos, and activity data.