Etymology and Terminology
– The word ‘data’ is the plural of ‘datum,’ meaning ‘thing given’ in Latin.
– The first English use of the word ‘data’ was in the 1640s.
– The term ‘data processing’ was first used in 1954.
– In everyday language and technical fields, ‘data’ is often used as a mass noun.
– Some style guides recognize the different meanings of the term, while others recommend the form that suits the target audience.
Meaning
– Data, information, knowledge, and wisdom are closely related concepts.
– Data becomes information after it has been analyzed.
– The extent to which data is informative depends on its unexpectedness.
– Knowledge is the awareness of the environment possessed by an entity.
– Data is often considered the least abstract concept, while knowledge is the most abstract.
Types of Data
– Data can be discrete or continuous.
– It can describe quantity, quality, facts, statistics, or other basic units of meaning.
– Data can be represented as numbers or characters.
– Field data is collected in an uncontrolled environment, while experimental data is generated in a controlled scientific experiment.
– Data sets include price indices, unemployment rates, literacy rates, and census data.
Data Analysis
– Data is analyzed using techniques such as calculation, reasoning, discussion, presentation, and visualization.
– Raw data is typically cleaned before analysis.
– Outliers and errors are removed from raw data.
– Data analysis can yield insights and intelligence.
– Data science uses machine learning and AI methods for efficient analysis of big data.
Data Collection and Longevity
– Data can be gathered through primary or secondary sources.
– Primary sources involve the researcher obtaining the data firsthand.
– Secondary sources involve the researcher obtaining data that has already been collected by other sources.
– Data analysis methodologies include data triangulation and data percolation.
– The longevity of data is an important concern in computer science, technology, and library science.
– Scientific research generates large amounts of data, but storing it on hard drives or optical discs may lead to unreadability after a few decades.
– Data accessibility is a problem, as much scientific data is never published or deposited in data repositories.
– Surveys have shown that the likelihood of retrieving data decreases over time after publication.
– The requirement for FAIR data (Findable, Accessible, Interoperable, and Reusable) aims to improve the reproducibility of research.
In common usage data (US: /ˈdætə/; UK: /ˈdeɪtə/) is a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally. A datum is an individual value in a collection of data. Data is usually organized into structures such as tables that provide additional context and meaning, and which may themselves be used as data in larger structures. Data may be used as variables in a computational process. Data may represent abstract ideas or concrete measurements. Data is commonly used in scientific research, economics, and in virtually every other form of human organizational activity. Examples of data sets include price indices (such as consumer price index), unemployment rates, literacy rates, and census data. In this context, data represents the raw facts and figures from which useful information can be extracted.
Data is collected using techniques such as measurement, observation, query, or analysis, and is typically represented as numbers or characters which may be further processed. Field data is data that is collected in an uncontrolled in-situ environment. Experimental data is data that is generated in the course of a controlled scientific experiment. Data is analyzed using techniques such as calculation, reasoning, discussion, presentation, visualization, or other forms of post-analysis. Prior to analysis, raw data (or unprocessed data) is typically cleaned: Outliers are removed and obvious instrument or data entry errors are corrected.
Data can be seen as the smallest units of factual information that can be used as a basis for calculation, reasoning, or discussion. Data can range from abstract ideas to concrete measurements, including, but not limited to, statistics. Thematically connected data presented in some relevant context can be viewed as information. Contextually connected pieces of information can then be described as data insights or intelligence. The stock of insights and intelligence that accumulates over time resulting from the synthesis of data into information, can then be described as knowledge. Data has been described as "the new oil of the digital economy". Data, as a general concept, refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.
Advances in computing technologies have led to the advent of big data, which usually refers to very large quantities of data, usually at the petabyte scale. Using traditional data analysis methods and computing, working with such large (and growing) datasets is difficult, even impossible. (Theoretically speaking, infinite data would yield infinite information, which would render extracting insights or intelligence impossible.) In response, the relatively new field of data science uses machine learning (and other artificial intelligence (AI)) methods that allow for efficient applications of analytic methods to big data.