Rule of thumb when naming files:
For data management (should be put under datasheet.md or the README.md file): see Datasheet for Datasets (Gebru et al., 2021) (these questions have been modified/simplified, to see the original questions with examples, please see the paper):
Motivation
Why the dataset was created?
Who created the dataset?
Who funded the project?
Composition
What do the observations represent?
Is the dataset the population or a small sample (random or nonrandom)?
Describe missing data (e.g., type and missing mechanism)
Describe data splits (e.g., training, testing, etc.)
Does the dataset contain sensitive or confidential information?
For dataset related to people
Can data users identify sub-population or individuals from the dataset?
Collection process
How the data was collected (directly observable or indirectly inferred)
What procedures were used to collect data?
What was the sampling strategy (e.g., probabilistic)?
Data time frame
Were any ethical review processes conducted? (e.g., institutional review board?)
For dataset related to people
Did you collect data from respondents or via third parties?
Did you obtain respondents’ consents?
Is there any mechanism for respondents to revoke their consent?
Did you conduct data protection impact analysis?
Prepossessing/cleaning/labeling
Describe these processes
Is the raw dataset still available upon request?
What was the software used to process data? And provide a script if possible
Uses
Provide exemplary uses of the dataset (a list of published papers using the dataset is encouraged)
Is there anything about the dataset (e.g., composition, collection, processing) that can impact future uses?
In which case the dataset should not be used?
Distribution
How will the dataset be distributed? (e.g., API, GitHub, data repo)
When will the dataset be available?
Does the dataset involve copyright, intellectual property (IP) license, or terms of use?
Maintenance
Who support/host/maintain the dataset?
How to contact data maintainer?
Is there any error that data users need to know?
Will the dataset be updated?
Will older versions of the dataset continue to be supported?
How can others extend/build on/contribute to the dataset?
A database of CEO turnover and dismissal in S&P 1500 firms, 2000–2018 (SMJ, 2021)
State Ideology data: only to 2017
Crunchbase: description on start-ups, companies, peoples
Dun&Bradstreet: company info
Privco: financial and market intelligence on private firms (USC has access)
Corporate Registration: search info on corporation by state
Thomasnet: for suppliers and buyers
Inside Airbnb: Free
airdna: paid
CryptoScamDB: Report about scams (only name of blacklisted domains)
BitcoinAbuse: report about scams with date, abuser, description, from country, and crated_at.
https://incidentdatabase.ai/ AI Incident Database
apptweak or apptweak.io (paid): app download, revenue, rating
apptopia (paid)
sensortower (paid)
appfollow (paid)
Apple Search Ads service: Apple Search Popularity Score
Instaloader: download content from Instagram
instagrapi: download and push data on Instagram
Fame: data on companies in the UK and Ireland.
Wharton Customer Analytics: write proposal to partner with firms to get data
TripAdvisor Review: using Octoparse
Julian McAuley’s lab data: Recommender System and personalization dataset
Goodread dataset: Book review. More
MovieLens: Movie recommendation
Ratebeer: Beer review. Sign up for API. Or use SNAP data
SerpAPI: paid
Alexa Web Information Service: retired as of Dec 8, 2021
ahrefs: Min $99
semrush: Min $119
kissmetrics: negotiable price
authoritas: $99
watchthem: $29
Cisco Umbrella: free (relative ranking data, not actual browsing activity)
rankwatch: $29 (keyword search)
SE Ranking: free trial - $32/month (url monitor)
Amazon QuickSight: fancy keyword research for enterprises
tranco: free for researchers (aggregate ranking lists)
cloudfare radar: Internet usage
comscore: paid enterprise for social and web traffic
moz: $99 web traffic
similarweb: enterprise solution $199 (cannot use API or request customized dataset unless you are using the ultimate package (>$20k)
infomer: check website safety
majestic: website traffic $49, but for API $399
SE Ranking: SEO tools and dashboard API is $192
serpstat: domain, keyword, URL, backlink analysis (recommended, API is available for every plan, as low as $55)
Complaints Board: can search for businesses
Ripoff Report: Complaints, Reviews, Scams, Lawsuits, Frauds
Open Payments: payments made by firms to doctors.
Bloomberg Law: through most university law schools
Google Case Law: free, but no api (might be able to bypass with SerpAPI)
Case Law Access Project: free with api
Nexis Uni: through university (no API)
Fair Trade:
HERI Data: The Higher Education Research Institute by UCLA
CIRP Freshman Survey Trends: 1966 to 2008
CIRP College Senior Survey Trends: 1994 to 2008
HERI Faculty Survey Trends: 1989 to 1998
NCAA Shareable Data: free
NCAA Injury Surveillance Program: have to contact
Donorschoose: (free for researchers) public school teachers can ask for classroom donations.
Education Data Partnership: access to K12 education data California.
NBA Baskeball 2000-2020
Formula 1 Race 1950 - 2017
National holiday: (Datta, 2022)
Changes to websites using Wayback Machine: can also submit links to track changes
Interesting Dataset every week: tidytuesday
Arora, A., Belenzon, S., & Sheer, L. (2021). Matching patents to compustat firms, 1980–2015: Dynamic reassignment, name changes, and ownership structures. Research Policy, 50(5), 104217.
Choo, J., Lee, C., Lee, D., Zha, H., & Park, H. (2014, February). Understanding and promoting micro-finance activities in kiva. org. In Proceedings of the 7th ACM international conference on Web search and data mining (pp. 583-592).
Colicev, A., Malshe, A., Pauwels, K., & O’Connor, P. (2018). Improving consumer mindset metrics and shareholder value through social media: The different roles of owned and earned media. Journal of Marketing, 82(1), 37-56.
Datta, Hannes, Harald J. van Heerde, Marnik G. Dekimpe, and Jan-Benedict E. M. Steenkamp (2022), “Cross-National Differences in Market Response: Line-Length, Price, and Distribution Elasticities in Fourteen Indo-Pacific Rim Economies,” Journal of Marketing Research, 59 (2), 251-70
Kogan, L., Papanikolaou, D., Seru, A., & Stoffman, N. (2017). Technological innovation, resource allocation, and growth. The Quarterly Journal of Economics, 132(2), 665-712.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.