This glossary of terms is designed to help leaders stay current with the ways that data is talked about and used in both public and private sectors.
This list provides definitions of key terms. If you’re interested in additional information, like use cases and examples, or want a copy to print out or share, you can download the full version here or by clicking on the link below.
If you have suggestions for terms we should add or definitions, feel free to email us.
The glossary is published with a Creative Commons CC BY-NC 4.0 licence. This means that it can be reused for non-commercial purposes, as long as Mastodon C are credited. License details can be found here.
A process or set of rules to carry out a particular task, for example data analysis algorithms. Often expressed in computer code.
The discovery, interpretation and communication of meaningful patterns and insights in data.
The process of removing detail from or otherwise transforming data, to avoid any identification of individuals or organisations.
“Intelligent behaviour” exhibited by machines, for example learning and problem solving.
Any form of data that due to its size, velocity (rate of change) or complexity pushes the limits of current storage and analytical capability.
The task of preparing data so that it can be used for a specific purpose, whether that’s analysis or sharing with others.
A general purpose programming language used to work on data projects.
Storing data on machines accessed remotely over an internet connection, as opposed to on a machine or server housed in your own building.
A CSV file is a Comma Separated Values file which allows data to be saved in a table structured format. CSVs look similar to a normal spreadsheet, but are reliably usable in more contexts.
“A set of values of qualitative or quantitative variables.” (Wikipedia). Information in raw form (such as alphabets, numbers, or symbols) that refer to, or represent, conditions, ideas, or objects. In the context of computing data can be thought of as information that is transmitted or stored.
A digital collection of data and the structure around which the data is organized.
Data science is an interdisciplinary exercise that aims to find useful answers and insights in data by combining mathematical, scientifically robust approaches with computer programming techniques.
Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
Deep Learning involves feeding a computer system a lot of data, which it can use to make decisions (Forbes). A subset of machine learning in Artificial Intelligence (AI), Deep Learning develops networks which are capable of learning unsupervised from data that is unstructured or unlabeled. Also known as Deep Neural Learning or Deep Neural Network (Investopedia).
A process in database and data warehousing meaning extracting the data from outside sources, transforming it to fit operational needs, and loading it into a database.
How data is structured and stored.
A geographic information system (GIS) is a system designed to capture, store, manipulate, analyze, manage, and present spatial or geographic data.
The processes, policies and tools that ensure data is formally managed, so that an organisation meets policy, legal, statutory, requirements, and so data can serve the mission and goals of an organisation.
An open-source software framework used for storage and processing of (typically) large datasets.
The connection of ordinary, everyday devices to the internet. Connection of everyday physical objects and products to the Internet so that they can relate to other systems or their data can be used and analysed.
A method of publishing structured data so that it can be interlinked and become more useful.
Data licensing is what tells someone what they can and can’t legally do with a piece of data or software.
A set of data that describes and gives information about what other data is about.
A subfield of artificial intelligence that gives computers the ability to learn without being explicitly programmed.
An abstract construct that organizes elements of data and standardizes how they relate to one another and to properties of real world entities.
Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with using computers to make sense of human language.
Data that anyone can access, use and share, for any purpose, without cost.
Software for which the original source code or information is made freely available and may be redistributed and modified in various ways depending on its licence.
Using data to predict what will happen next - for example what someone is likely to buy or visit, or how something will behave.
A place where data is published for use by others.
Python is a widely used high-level programming language for general-purpose programming
R is an open source programming language and software environment for statistical computing.
A data format, RDF is a framework for describing resources on the web. RDF is designed to be read and understood by computers.
Using data and algorithms against unstructured text, to provide insights into what people are thinking and feeling.
A software tool that you access from your browser rather than one that is downloaded and installed onto your device.
Apache Spark is a “fast and general engine for large-scale data processing” (Apache). It was built for speed, ease of use, and analytics.
Data that is identifiable and easy to use, as it is pre-organized in structure like rows and columns.
A very common kind of data which describes events such as payments, events in a system, or appointments, often held in a large database or data warehouse.
Unstructured data is data that is in general text heavy, but may also contain dates, numbers and facts.
Representing data, or relationships between data, in a visual manner so as to communicate a finding, relationship or story.
The speed at which the data is created, stored, analysed and visualized.
XML stands for Extensible Markup Language (XML).