Data

Over the years, I’ve had the opportunity to work with a range of anonymized datasets that offer deep insight into consumer behavior, telecommunications usage, credit dynamics, and personal finance patterns. These datasets are real-world, large-scale, and well-suited for empirical research across several disciplines.

Examples include:

Behavioral telco data covering mobile usage patterns
Credit score and payment history insights across diverse demographics
Household-level financial and transactional records

I’m always open to thoughtful, research-driven collaborations—especially with those who have a strong background in data science, economics, or quantitative social sciences. If you’re exploring real-world applications of your work and enjoy working with complex, meaningful data, I’d love to hear from you.

Academic researchers, especially those based in institutions with active research programs in these areas, are particularly encouraged to reach out.

Let’s explore what’s possible—feel free to get in touch.

Reproducible Data Management Standards

Rule of thumb when naming files:

Human readable: info about content
- Numeric values always go first
- ISO 8601 standard (yyyy-mm-dd) for date (but would not recommend using date in the first place)
- Never delete leading 0.
Compatible with every system default ordering
Avoid spaces, uppercase.

For data management (should be put under datasheet.md or the README.md file): see Datasheet for Datasets (Gebru et al., 2021) (these questions have been modified/simplified, to see the original questions with examples, please see the paper):

Motivation
- Why the dataset was created?
- Who created the dataset?
- Who funded the project?
Composition
- What do the observations represent?
- Is the dataset the population or a small sample (random or nonrandom)?
- Describe missing data (e.g., type and missing mechanism)
- Describe data splits (e.g., training, testing, etc.)
- Does the dataset contain sensitive or confidential information?
  
  For dataset related to people
- Can data users identify sub-population or individuals from the dataset?
Collection process
- How the data was collected (directly observable or indirectly inferred)
- What procedures were used to collect data?
- What was the sampling strategy (e.g., probabilistic)?
- Data time frame
- Were any ethical review processes conducted? (e.g., institutional review board?)
  
  For dataset related to people
- Did you collect data from respondents or via third parties?
- Did you obtain respondents’ consents?
- Is there any mechanism for respondents to revoke their consent?
- Did you conduct data protection impact analysis?
Prepossessing/cleaning/labeling
- Describe these processes
- Is the raw dataset still available upon request?
- What was the software used to process data? And provide a script if possible
Uses
- Provide exemplary uses of the dataset (a list of published papers using the dataset is encouraged)
- Is there anything about the dataset (e.g., composition, collection, processing) that can impact future uses?
- In which case the dataset should not be used?
Distribution
- How will the dataset be distributed? (e.g., API, GitHub, data repo)
- When will the dataset be available?
- Does the dataset involve copyright, intellectual property (IP) license, or terms of use?
Maintenance
- Who support/host/maintain the dataset?
- How to contact data maintainer?
- Is there any error that data users need to know?
- Will the dataset be updated?
- Will older versions of the dataset continue to be supported?
- How can others extend/build on/contribute to the dataset?

Management

MSCI (formerly GMI Ratings)
- Companies
  - Corporate Ownership
- Directors:
  - Corporate Board structure
  - Independence Director Positions
  - Committee Assignments
  - Director Compensation
- CEO Compensation
- Takeover defenses
World Management Survey
Harvard’s “Creating Emerging Markets”
Workplace discrimination
CoreSignal- Employee Review: Employee reviews
CoreSignal - Job Posting
CoreSignal - Employee data
CSRHub by Consensus ESG ratings: Data on CSR around the world (paid)
CEO Dismissal: by Gentry et al. (1992-2018)
- A database of CEO turnover and dismissal in S&P 1500 firms, 2000–2018 (SMJ, 2021)
- R code

Innovation

Patent Data

See this post for more details
Global Entrepreneurship Monitor
Firm and Industry Evolution, Entrepreneurship, and Strategy
NBER patent
Matching USPTO Patent Assignees to Compustat Public Firms and SDC Private Firms
Google Patents Public Datasets on BigQuery: worldwide + USPTO full-text
USPTO PatentsView: US only (with raw data)
PATSTAT: Europe patent from EPO
DISCERN - Duke Innovation & Scientific Enterprises Research Network: (Arora & Sheer, 2021) strongly recommend using this one to match patent and Compustat
Extended Data (till 2020) following Kogan et al. (2017)
Lens: cost to search
WIPO: with guide for coding
UVA Darden Global Corporate Patent Dataset
PatCit: A Comprehensive Dataset of Patent Citations
CoreSignal - Repo: software projects, experienced developers, and data analysts.
KickStarter
KickStarter data from Kaggle
Indiegogo
AI Patent Dataset: Identifying AI invention

Technology Adoption

The CHAT (Cross-country Historical Adoption of Technology) Dataset: download

Academic Literature

Microsoft Academic Knowledge Graph
OpenAlex
PubMed
Web of Science
Scopus
Reliance on Science in Patenting: front-page and in-text citations from patents to scientific articles through 2020

Demographic Data

Politics

Realtime NOMINATE Ideology
State Ideology data: only to 2017

Geography

County Business Patterns (Geo data)
Google Open Buildings: A dataset of building footprints
- Leafmap (Python code)
Foursquare OS Places
Overture Maps Data

Finance

Compustat
- Execucomp
- Capital IQ - Key Developments: event types for companies
CRSP
Worldscope: fundamental data for major international firms. Searchable by company name, country, exchange, or fundamental items. 1980+. (via WRDS)
List of private firms
- Crunchbase: description on start-ups, companies, peoples
- Dun&Bradstreet: company info
- Privco: financial and market intelligence on private firms (USC has access)
- Corporate Registration: search info on corporation by state
- SageWorks
- Thomasnet: for suppliers and buyers
- PitchBook
World Bank
IMF
Mergent Online
Orbis: global firms including private (is survivor biased, companies are dropped after 10 years if not active)
SDC Platinum: historical transactions
States of Incorporation
Zillow Home Prices: Zillow’s API allows business access only (not for educational or research purposes). But they do allow to download small dataset
Redfin: full data set, definition
National Association of Realtors: Citation
10-K Text data by Hoberg-Phillips
Thomson Reuters
- Eikon
Pitchbook: private and public data
OANDA: currency-related data
Professor French database
open Data Network
World Bank
Data Current
Trading Economics: paid API service
AQR Dataset
CoreSignal - Company funding
CoreSignal Firmographic data
Failory: Data on failed businesses
HUD: American Housing survey, housing data, public housing population
XBRL Research: measure firm and accounting complexity
International Fundamentals:
- China: CSMAR (via WRDS)
- EU: Amadeus
- India: Prowess
- Canada: SEDAR
- Japan: EDINET
Economic Policy Uncertainty
DataCore: Data provider in Vietnam

Taxes

Housing

Inside Airbnb: Free
airdna: paid

Factors

Risk

Crypto Currency

CryptoScamDB: Report about scams (only name of blacklisted domains)
BitcoinAbuse: report about scams with date, abuser, description, from country, and crated_at.

Textual Network Data in Finance

Bankruptcy

Florida-UCLA-LoPucki Bankruptcy Research Database (BRD): until December 2022

Heir

heirbase

Economics

Journal of Applied Econometrics
DICE - Database for Instituional Comparisons of Economies
UK Data Service
World Bank
Bureau of Economic Analysis: GDP, personal Income, International Trade, and Transactions.
International Monetary Fund (IMF) Data
Consumer Expenditure Surveys
U.S. Bureau of Labor Statistics
- Details
Kiva: micro loan data. download snapshots. To see the data summary
- Preprocessed data (Choo et al, 2014)
Unemployment rate: FRED Economic Data St. Louis FED
EIA: U.S. Energy Information Administration (also has API)
International Financial Reforms: 91 economies over 1973–2005

Pay Records/ Salary

Plaid
Payscale
- Developer

Marketing

Data Breach

privacyright.org
DLDOS: Data Loss Database - Open Source (2000-2008)
Darknet market archives
https://incidentdatabase.ai/ AI Incident Database

App

apptweak or apptweak.io (paid): app download, revenue, rating
apptopia (paid)
sensortower (paid)
appfollow (paid)
Apple Search Ads service: Apple Search Popularity Score

Instaloader: download content from Instagram
instagrapi: download and push data on Instagram

Product Introduction

Mintel
IRI

Sentiment

DataStreamer: Search API returns search results from multiple sources (Twitter, Inst gram, Blogs, Forums, News, International News).
Bloomberg: sentiments based on news articles and Twitter
Meltwater: paid
Infegy: paid

Firm

Fame: data on companies in the UK and Ireland.
Wharton Customer Analytics: write proposal to partner with firms to get data

Branding

EquiTrend The Harris Poll 1; 2:
- three factors – Familiarity, Quality and Purchase Consideration
- 45,000 US consumers assessed nearly 2,000 brands across 196 categories.
- 91 companies were awarded the coveted Brand of the Year designation across 87 categories
World Brand Lab: brand equity ranking in China.
Brand24 - Media Monitoring Tool: monitor brand from Twitter, Facebook, and Instagram.
Affectiva - Humanizing Technology : Affectiva data on emotional reactions of 53k ads over 90 countries and 8 years.
YouGov: daily data on brands. (see Colicev et al. 2018 when using this dataset).

Advertising

WARC
- Adspend
- Media Owner Profile
Winmo: Adspend per firm/brand
AdAge: data on ad spending
IAB Advertising Spend and Revenue Research
Facebook Ads API: Meta Ad Library API
Kantar Media
- Vivvix: combine Kantar and Numerator into an independent firm (IU, Penn State, and Emory subscribe)
Ad-targeting labels
eMarketer (Insider Intelligence): total media ad spending by media (digital, newspaper, magazine, radio, mobile, TV, out-of-home) and digital format (display, video, rich media, classified, paid search, radio, podcast, social media)

Datareportal: on online activities

Others

American Customer Satisfaction Index (ACSI)
Marketing Science Data Resource
Instacart
- Instacart order: data
CMO Spend Survey 2018–2019
GWI Global Consumer data (students might get free access)
Global Market Database: global market data
Safeguard Data: Foot traffic, mobile data, transaction data.
comScore Media Metrix: Online Traffic (direct traffic, search engine referrals, transaction counts)
Global Market Information Database by Euromonitor International (good for studying market penetration).
IRI Marketing Data Set: Panel Scanner data for academic research for 30 product categories in 5 years that include sales, pricing, promotion data.
AiMark Scanner Data
CoreSignal Tech Product Review + Technographic data
Better Marketing for a Better World
GWI: Paid + Free data on marketing strategy globally.
Data.ai: Paid + Free data on app store tracking, reviews, usage, optimization, paid search, revenue estimates, Game IQ, advertising estimate
Similarweb: Paid + Free data on website traffic, app analysis
Box Office Mojo: data on movie box office performance
The Numbers: Data on movie
Metacritic: Data on movie reviews
camelcamelcamel: Amazon price tracker
Google Trend Dataset available via BigQuery
Webscrape datasets
Ward Intelligence: Car data
J. D. Power: car data (paid)
Data Provider: industry data (paid) with API
Datarade: Data marketplace
Food Facts
FDA: api
Parking Reforms: map
AI Harm Incidents: previously
Data Axle: Consumer Data
decadata: consumer data
SimplyAnalytics: demographic, historic census, business, health, real estate, housing, employment, consumer spending, and marketing

Review

TripAdvisor Review: using Octoparse
- Small dataset of TripAdvisor
Epinions and Ciao
Yelp Review
Julian McAuley’s lab data: Recommender System and personalization dataset
Trust Pilot: Consumer review
Goodread dataset: Book review. More
MovieLens: Movie recommendation
Ratebeer: Beer review. Sign up for API. Or use SNAP data

Search Engine Data

Web Traffic

Alexa Web Information Service: retired as of Dec 8, 2021
ahrefs: Min $99
semrush: Min $119
kissmetrics: negotiable price
authoritas: $99
watchthem: $29
Cisco Umbrella: free (relative ranking data, not actual browsing activity)
rankwatch: $29 (keyword search)
SE Ranking: free trial - $32/month (url monitor)
Amazon QuickSight: fancy keyword research for enterprises
tranco: free for researchers (aggregate ranking lists)
cloudfare radar: Internet usage
comscore: paid enterprise for social and web traffic
moz: $99 web traffic
similarweb: enterprise solution $199 (c a n n o t u s e A P I o r r e q u e s t c u s t o m i z e d d a t a s e t u n l e s s y o u a r e u s i n g t h e u l t i m a t e p a c k a g e (>$ 20k)
infomer: check website safety
majestic: website traffic $49, b u t f o r A P I$ 399
SE Ranking: SEO tools and dashboard API is $192
serpstat: domain, keyword, URL, backlink analysis (recommended, API is available for every plan, as low as $55)

Sales

Kilts Center for Marketing: Chicago Booth
- Subscription Dataset:
  - Nielsen Dataset
- Public Datasets:
  - Dominick’s Dataset: 1989 - 1994, store level data, shelf management and pricing
  - ERIM Dataset: households, TV viewing data is available to measure exposure to commercials involving the products
  - BAYESM More like a software, but also includes some panel data.
International Data Corporation: Market share

Complaints

Consumer Financial Protection Bureau
Complaints Board: can search for businesses
Federal Trade Commission
Ripoff Report: Complaints, Reviews, Scams, Lawsuits, Frauds

Health

NHS Digital
HealthData
WHO
AWS Registry
Medicare and Medicaid
Child Health and Development Studies
National Vital Statistics System
National Syndrome Surveillance Problem
- R Implementation
Human Mortality
NIH Data
MHealth Dataset
Agency for Healthcare Research and Quality
Healthcare Cost and Utilization Project
- R implementation
Open Payments: payments made by firms to doctors.
IQVIA: Drug data.

Politics

Open Secret: Elections and Fundraising Data

Laws

State Tracking (free)
Bloomberg Law: through most university law schools
Google Case Law: free, but no api (might be able to bypass with SerpAPI)
Case Law Access Project: free with api
Nexis Uni: through university (no API)
Fair Trade:
- Github
- Copyright

Higher Education

HERI Data: The Higher Education Research Institute by UCLA
- CIRP Freshman Survey Trends: 1966 to 2008
- CIRP College Senior Survey Trends: 1994 to 2008
- HERI Faculty Survey Trends: 1989 to 1998
National Survey of College Graduates
NCAA Shareable Data: free
NCAA Injury Surveillance Program: have to contact
Equity in Athletics Data Analysis: free
Donorschoose: (free for researchers) public school teachers can ask for classroom donations.
Education Data Partnership: access to K12 education data California.
College Enrollment Data by OpenSDP

Institute For Social Research (ICPSR): data archive at University of Michigan (free).
Data.gov: since 2009 and by the U.S. General Services Administration
SODA: API data from governments, non-profits, and NGOs.
Google Public Data
Open ICPSR
SRDA: Chinese Survey Research Data Archive
Consortium of European Social Science Data Archives(cessda)
Pew Research Center
Kaggle
Data Planet: by SAGE Publishing Resource.
Mendeley: mostly based on published paper (i.e., host dataset for published papers)
Humdata: Humanitarian Data Exchange (including Facebook’s Data for Good data)
Census Data:
- IPUMS
- IPUMS Data in R
Awesome Data on Everything
Government Data
Political Dataset
UK Data Service
GSMA Intelligence: Paid data on cellular connection, mobile subscribers, IoT
Google Cloud Dataset
Paperswithcode
Datahub: free
LGBTQ+ rights around the world (equity index): api

Sports

NFL

Soccer

Baseball

Basketball

College

Hockey

Others

Network

Machine Learning

Video data

Public APIs

Potential Instruments

Control Variables

National holiday: (Datta, 2022)
Weather Underground by IP Address
Changes to websites using Wayback Machine: can also submit links to track changes
Interesting Dataset every week: tidytuesday

References

Arora, A., Belenzon, S., & Sheer, L. (2021). Matching patents to compustat firms, 1980–2015: Dynamic reassignment, name changes, and ownership structures. Research Policy, 50(5), 104217.

Choo, J., Lee, C., Lee, D., Zha, H., & Park, H. (2014, February). Understanding and promoting micro-finance activities in kiva. org. In Proceedings of the 7th ACM international conference on Web search and data mining (pp. 583-592).

Colicev, A., Malshe, A., Pauwels, K., & O’Connor, P. (2018). Improving consumer mindset metrics and shareholder value through social media: The different roles of owned and earned media. Journal of Marketing, 82(1), 37-56.

Datta, Hannes, Harald J. van Heerde, Marnik G. Dekimpe, and Jan-Benedict E. M. Steenkamp (2022), “Cross-National Differences in Market Response: Line-Length, Price, and Distribution Elasticities in Fourteen Indo-Pacific Rim Economies,” Journal of Marketing Research, 59 (2), 251-70

Kogan, L., Papanikolaou, D., Seru, A., & Stoffman, N. (2017). Technological innovation, resource allocation, and growth. The Quarterly Journal of Economics, 132(2), 665-712.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.

Data

Reproducible Data Management Standards

Management

Innovation

Patent Data

Technology Adoption

Academic Literature

Demographic Data

Politics

Geography

Finance

Taxes

Housing

Factors

Risk

Crypto Currency

Textual Network Data in Finance

Bankruptcy

Heir

Economics

Pay Records/ Salary

Marketing

Data Breach

App

Social Media

Product Introduction

Sentiment

Firm

Branding

Advertising

Others

Review

Search Engine Data

Web Traffic

Sales

Complaints

Health

Politics

Laws

Higher Education

Social Science

Sports

NFL

Soccer

Baseball

Basketball

College

Hockey

Others

Network

Machine Learning

Video data

Public APIs

Potential Instruments

Control Variables

References