Rule of thumb when naming files:

  • Human readable: info about content
    • Numeric values always go first
    • ISO 8601 standard (yyyy-mm-dd) for date (but would not recommend using date in the first place)
    • Never delete leading 0.
  • Compatible with every system default ordering
  • Avoid spaces, uppercase.

For data management (should be put under or the file): see Datasheet for Datasets (Gebru et al., 2021) (these questions have been modified/simplified, to see the original questions with examples, please see the paper):

  • Motivation

    • Why the dataset was created?

    • Who created the dataset?

    • Who funded the project?

  • Composition

    • What do the observations represent?

    • Is the dataset the population or a small sample (random or nonrandom)?

    • Describe missing data (e.g., type and missing mechanism)

    • Describe data splits (e.g., training, testing, etc.)

    • Does the dataset contain sensitive or confidential information?

      For dataset related to people

    • Can data users identify sub-population or individuals from the dataset?

  • Collection process

    • How the data was collected (directly observable or indirectly inferred)

    • What procedures were used to collect data?

    • What was the sampling strategy (e.g., probabilistic)?

    • Data time frame

    • Were any ethical review processes conducted? (e.g., institutional review board?)

      For dataset related to people

    • Did you collect data from respondents or via third parties?

    • Did you obtain respondents’ consents?

    • Is there any mechanism for respondents to revoke their consent?

    • Did you conduct data protection impact analysis?

  • Prepossessing/cleaning/labeling

    • Describe these processes

    • Is the raw dataset still available upon request?

    • What was the software used to process data? And provide a script if possible

  • Uses

    • Provide exemplary uses of the dataset (a list of published papers using the dataset is encouraged)

    • Is there anything about the dataset (e.g., composition, collection, processing) that can impact future uses?

    • In which case the dataset should not be used?

  • Distribution

    • How will the dataset be distributed? (e.g., API, GitHub, data repo)

    • When will the dataset be available?

    • Does the dataset involve copyright, intellectual property (IP) license, or terms of use?

  • Maintenance

    • Who support/host/maintain the dataset?

    • How to contact data maintainer?

    • Is there any error that data users need to know?

    • Will the dataset be updated?

    • Will older versions of the dataset continue to be supported?

    • How can others extend/build on/contribute to the dataset?



Patent Data

Technology Adoption

Academic Literature






Crypto Currency

  • CryptoScamDB: Report about scams (only name of blacklisted domains)

  • BitcoinAbuse: report about scams with date, abuser, description, from country, and crated_at.

Textual Network Data in Finance




Pay Records/ Salary


Data Breach


Social Media

Product Introduction


  • DataStreamer: Search API returns search results from multiple sources (Twitter, Inst gram, Blogs, Forums, News, International News).
  • Bloomberg: sentiments based on news articles and Twitter
  • Meltwater: paid
  • Infegy: paid



  • EquiTrend The Harris Poll 1; 2:
    • three factors – Familiarity, Quality and Purchase Consideration
    • 45,000 US consumers assessed nearly 2,000 brands across 196 categories.
    • 91 companies were awarded the coveted Brand of the Year designation across 87 categories
  • World Brand Lab: brand equity ranking in China.
  • Brand24 - Media Monitoring Tool: monitor brand from Twitter, Facebook, and Instagram.
  • Affectiva - Humanizing Technology : Affectiva data on emotional reactions of 53k ads over 90 countries and 8 years.
  • YouGov: daily data on brands. (see Colicev et al. 2018 when using this dataset).




Search Engine Data

Web Traffic






Higher Education

Social Science










Machine Learning

Video data

Public APIs

Potential Instruments

Control Variables


Arora, A., Belenzon, S., & Sheer, L. (2021). Matching patents to compustat firms, 1980–2015: Dynamic reassignment, name changes, and ownership structures. Research Policy, 50(5), 104217.

Choo, J., Lee, C., Lee, D., Zha, H., & Park, H. (2014, February). Understanding and promoting micro-finance activities in kiva. org. In Proceedings of the 7th ACM international conference on Web search and data mining (pp. 583-592).

Colicev, A., Malshe, A., Pauwels, K., & O’Connor, P. (2018). Improving consumer mindset metrics and shareholder value through social media: The different roles of owned and earned media. Journal of Marketing, 82(1), 37-56.

Datta, Hannes, Harald J. van Heerde, Marnik G. Dekimpe, and Jan-Benedict E. M. Steenkamp (2022), “Cross-National Differences in Market Response: Line-Length, Price, and Distribution Elasticities in Fourteen Indo-Pacific Rim Economies,” Journal of Marketing Research, 59 (2), 251-70

Kogan, L., Papanikolaou, D., Seru, A., & Stoffman, N. (2017). Technological innovation, resource allocation, and growth. The Quarterly Journal of Economics, 132(2), 665-712.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.