Rule of thumb when naming files:
For data management (should be put under datasheet.md or the README.md file): see Datasheet for Datasets (Gebru et al., 2021) (these questions have been modified/simplified, to see the original questions with examples, please see the paper):
Why the dataset was created?
Who created the dataset?
Who funded the project?
What do the observations represent?
Is the dataset the population or a small sample (random or nonrandom)?
Describe missing data (e.g., type and missing mechanism)
Describe data splits (e.g., training, testing, etc.)
Does the dataset contain sensitive or confidential information?
For dataset related to people
Can data users identify sub-population or individuals from the dataset?
How the data was collected (directly observable or indirectly inferred)
What procedures were used to collect data?
What was the sampling strategy (e.g., probabilistic)?
Data time frame
Were any ethical review processes conducted? (e.g., institutional review board?)
For dataset related to people
Did you collect data from respondents or via third parties?
Did you obtain respondents' consents?
Is there any mechanism for respondents to revoke their consent?
Did you conduct data protection impact analysis?
Describe these processes
Is the raw dataset still available upon request?
What was the software used to process data? And provide a script if possible
Provide exemplary uses of the dataset (a list of published papers using the dataset is encouraged)
Is there anything about the dataset (e.g., composition, collection, processing) that can impact future uses?
In which case the dataset should not be used?
How will the dataset be distributed? (e.g., API, GitHub, data repo)
When will the dataset be available?
Who support/host/maintain the dataset?
How to contact data maintainer?
Is there any error that data users need to know?
Will the dataset be updated?
Will older versions of the dataset continue to be supported?
How can others extend/build on/contribute to the dataset?
A database of CEO turnover and dismissal in S&P 1500 firms, 2000–2018 (SMJ, 2021)
Crunchbase: description on start-ups, companies, peoples
Dun&Bradstreet: company info
Privco: financial and market intelligence on private firms
Corporate Registration: search info on corporation by state
Thomasnet: for suppliers and buyers
CryptoScamDB: Report about scams (only name of blacklisted domains)
BitcoinAbuse: report about scams with date, abuser, description, from country, and crated_at.
https://incidentdatabase.ai/ AI Incident Database
apptweak or apptweak.io (paid): app download, revenue, rating
Apple Search Ads service: Apple Search Popularity Score
Instaloader: download content from Instagram
instagrapi: download and push data on Instagram
Fame: data on companies in the UK and Ireland.
Wharton Customer Analytics: write proposal to partner with firms to get data
TripAdvisorReview: using Octoparse
Julian McAuley’s lab data: Recommender System and personalization dataset
Goodread dataset: Book review. More
MovieLens: Movie recommendation
Ratebeer: Beer review. Sign up for API. Or use SNAP data
Alexa Web Information Service: retired as of Dec 8, 2021
ahrefs: Min $99
semrush: Min $119
kissmetrics: negotiable price
Cisco Umbrella: free (relative ranking data, not actual browsing activity)
rankwatch: $29 (keyword search)
SE Ranking: free trial - $32/month (url monitor)
Amazon QuickSight: fancy keyword research for enterprises
tranco: free for researchers (aggregate ranking lists)
cloudfare radar: Internet usage
comscore: paid enterprise for social and web traffic
moz: $99 web traffic
similarweb: enterprise solution $199 (cannot use API or request customized dataset unless you are using the ultimate package (>$20k)
infomer: check website safety
majestic: website traffic $49, but for API $399
SE Ranking: SEO tools and dashboard API is $192
serpstat: domain, keyword, URL, backlink analysis (recommended, API is available for every plan, as low as $55)
Complaints Board: can search for businesses
Ripoff Report: Complaints, Reviews, Scams, Lawsuits, Frauds
Open Payments: payments made by firms to doctors.
Bloomberg Law: through most university law schools
Google Case Law: free, but no api (might be able to bypass with SerpAPI)
Case Law Access Project: free with api
Nexis Uni: through university (no API)
HERI Data: The Higher Education Research Institute by UCLA
CIRP Freshman Survey Trends: 1966 to 2008
CIRP College Senior Survey Trends: 1994 to 2008
HERI Faculty Survey Trends: 1989 to 1998
NCAA Shareable Data: free
NCAA Injury Surveillance Program: have to contact
Donorschoose: (free for researchers) public school teachers can ask for classroom donations.
Education Data Partnership: access to K12 education data California.
National holiday: (Datta, 2022)
Changes to websites using Wayback Machine: can also submit links to track changes
Interesting Dataset every week: tidytuesday
Arora, A., Belenzon, S., & Sheer, L. (2021). Matching patents to compustat firms, 1980–2015: Dynamic reassignment, name changes, and ownership structures. Research Policy, 50(5), 104217.
Choo, J., Lee, C., Lee, D., Zha, H., & Park, H. (2014, February). Understanding and promoting micro-finance activities in kiva. org. In Proceedings of the 7th ACM international conference on Web search and data mining (pp. 583-592).
Colicev, A., Malshe, A., Pauwels, K., & O’Connor, P. (2018). Improving consumer mindset metrics and shareholder value through social media: The different roles of owned and earned media. Journal of Marketing, 82(1), 37-56.
Datta, Hannes, Harald J. van Heerde, Marnik G. Dekimpe, and Jan-Benedict E. M. Steenkamp (2022), “Cross-National Differences in Market Response: Line-Length, Price, and Distribution Elasticities in Fourteen Indo-Pacific Rim Economies,” Journal of Marketing Research, 59 (2), 251-70
Kogan, L., Papanikolaou, D., Seru, A., & Stoffman, N. (2017). Technological innovation, resource allocation, and growth. The Quarterly Journal of Economics, 132(2), 665-712.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.