Data can be any collection of information, whether in the form of numbers, text, images, audio, etc. It can be classified across three dimensions: numerical vs. categorical; cross-sectional vs. time-series vs. panel; and structured vs. unstructured.
Numerical vs Categorical
The key distinction between numerical and categorical data types is that meaningful arithmetic calculations can be made on numerical data but not on categorical data.
Numerical or quantitative data represents measured or counted quantities as a number. Such data can be discrete or continuous.
Discrete data result from a counting process and have a finite number of values. For example, m represents the number of times per year interest is compounded. For non-continuous compounding that value could be as large as 365 for daily compounding or as small as 1 for annual compounding.
Continuous data can take on any numerical value within a specified range. For example, the value of a stock option can vary within a range of values based on changes in the value of the underlying stock.
Categorical or qualitative data describe a characteristic of the information and can be used to categorize or visualize the information. Such data can be nominal or ordinal.
Nominal data are not amenable to ordering or ranking. It can be presented as text or numbers. For example, the aisle a product appears on in a supermarket may signify that it is a certain type of product but the products on aisle 3 are not inherently better or worse than those on aisle 7.
Ordinal data can be logically ordered or ranked. For example, a list of the 100 largest stocks by market capitalization. In this case, we know 1 is larger than 2, and so on. However, such data do not tell us how much larger.
Cross-sectional vs Time-series vs Panel
The distinction between these data types is how the data are collected.
Cross-sectional data represent the observations of a specific variable from multiple observation units at a given time. An example would be the closing price on 31 January of each stock in the S&P 500 Index.
Time-series data represent a sequence of observations of a single observational unit of a variable collected over time, typically over equal discrete intervals. An example would be the daily closing price of Apple shares for a given year.
Panel data represent a mix of cross-sectional and time-series data. It is typically organized in a matrix format. The daily closing price of all shares in the S&P 500 could be represented as a matrix with each column representing one company’s shares and rows representing each day’s closing price.
Structured vs Unstructured
The key distinction between structured and unstructured data types is whether the data are in a highly organized form.
Structured data are well-organized in a pre-determined manner. They can be easily entered, stored, and analyzed without manual processing. Most time-series, cross-sectional, and panel data are structured.
Unstructured data do not follow conventional organized forms. The text of a press release or the audio from a conference call are two examples. It can be classified into three groups.
- data generated by individuals, such as social media posts
- data generated by business processes, such as credit card transactions
- data generated by sensors, such as the GPS radio on a mobile phone
Unstructured data may offer new insights. For example, counting the number of cars parked at different store locations may indicate the relative performance of each location. However, it does not indicate how many people were in each car or how much (if anything) they each spent. To be used in financial models, it is usually necessary to process unstructured data into structured data. For example, the number of cars could be multiplied by the average sale per customer or indexed to the level of sales in a previous period.