Data

Rule of thumb when naming files:

  • Human readable: info about content
    • Numeric values always go first
    • ISO 8601 standard (yyyy-mm-dd) for date (but would not recommend using date in the first place)
    • Never delete leading 0.
  • Compatible with every system default ordering
  • Avoid spaces, uppercase.

For data management (should be put under datasheet.md or the README.md file): see Datasheet for Datasets (Gebru et al., 2021) (these questions have been modified/simplified, to see the original questions with examples, please see the paper):

  • Motivation

    • Why the dataset was created?

    • Who created the dataset?

    • Who funded the project?

  • Composition

    • What do the observations represent?

    • Is the dataset the population or a small sample (random or nonrandom)?

    • Describe missing data (e.g., type and missing mechanism)

    • Describe data splits (e.g., training, testing, etc.)

    • Does the dataset contain sensitive or confidential information?

      For dataset related to people

    • Can data users identify sub-population or individuals from the dataset?

  • Collection process

    • How the data was collected (directly observable or indirectly inferred)

    • What procedures were used to collect data?

    • What was the sampling strategy (e.g., probabilistic)?

    • Data time frame

    • Were any ethical review processes conducted? (e.g., institutional review board?)

      For dataset related to people

    • Did you collect data from respondents or via third parties?

    • Did you obtain respondents’ consents?

    • Is there any mechanism for respondents to revoke their consent?

    • Did you conduct data protection impact analysis?

  • Prepossessing/cleaning/labeling

    • Describe these processes

    • Is the raw dataset still available upon request?

    • What was the software used to process data? And provide a script if possible

  • Uses

    • Provide exemplary uses of the dataset (a list of published papers using the dataset is encouraged)

    • Is there anything about the dataset (e.g., composition, collection, processing) that can impact future uses?

    • In which case the dataset should not be used?

  • Distribution

    • How will the dataset be distributed? (e.g., API, GitHub, data repo)

    • When will the dataset be available?

    • Does the dataset involve copyright, intellectual property (IP) license, or terms of use?

  • Maintenance

    • Who support/host/maintain the dataset?

    • How to contact data maintainer?

    • Is there any error that data users need to know?

    • Will the dataset be updated?

    • Will older versions of the dataset continue to be supported?

    • How can others extend/build on/contribute to the dataset?

Management

Innovation

Patent Data

Technology Adoption

Academic Literature

Demographic Data

Politics

Geography

Finance

Taxes

Housing

Factors

Risk

Crypto Currency

  • CryptoScamDB: Report about scams (only name of blacklisted domains)

  • BitcoinAbuse: report about scams with date, abuser, description, from country, and crated_at.

Textual Network Data in Finance

Bankruptcy

Heir

Economics

Pay Records/ Salary

Marketing

Data Breach

App

Social Media

Product Introduction

Sentiment

  • DataStreamer: Search API returns search results from multiple sources (Twitter, Inst gram, Blogs, Forums, News, International News).
  • Bloomberg: sentiments based on news articles and Twitter
  • Meltwater: paid
  • Infegy: paid

Firm

Branding

  • EquiTrend The Harris Poll 1; 2:
    • three factors – Familiarity, Quality and Purchase Consideration
    • 45,000 US consumers assessed nearly 2,000 brands across 196 categories.
    • 91 companies were awarded the coveted Brand of the Year designation across 87 categories
  • World Brand Lab: brand equity ranking in China.
  • Brand24 - Media Monitoring Tool: monitor brand from Twitter, Facebook, and Instagram.
  • Affectiva - Humanizing Technology : Affectiva data on emotional reactions of 53k ads over 90 countries and 8 years.
  • YouGov: daily data on brands. (see Colicev et al. 2018 when using this dataset).

Advertising

Others

Review

Search Engine Data

Web Traffic

Sales

Complaints

Health

Politics

Laws

Higher Education

Social Science

Sports

NFL

Soccer

Baseball

Basketball

College

Hockey

Others

Network

Machine Learning

Video data

Public APIs

Potential Instruments

Control Variables

References

Arora, A., Belenzon, S., & Sheer, L. (2021). Matching patents to compustat firms, 1980–2015: Dynamic reassignment, name changes, and ownership structures. Research Policy, 50(5), 104217.

Choo, J., Lee, C., Lee, D., Zha, H., & Park, H. (2014, February). Understanding and promoting micro-finance activities in kiva. org. In Proceedings of the 7th ACM international conference on Web search and data mining (pp. 583-592).

Colicev, A., Malshe, A., Pauwels, K., & O’Connor, P. (2018). Improving consumer mindset metrics and shareholder value through social media: The different roles of owned and earned media. Journal of Marketing, 82(1), 37-56.

Datta, Hannes, Harald J. van Heerde, Marnik G. Dekimpe, and Jan-Benedict E. M. Steenkamp (2022), “Cross-National Differences in Market Response: Line-Length, Price, and Distribution Elasticities in Fourteen Indo-Pacific Rim Economies,” Journal of Marketing Research, 59 (2), 251-70

Kogan, L., Papanikolaou, D., Seru, A., & Stoffman, N. (2017). Technological innovation, resource allocation, and growth. The Quarterly Journal of Economics, 132(2), 665-712.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.