Mastodon C

Big Data Done Better

Glossary

a leader’s A-Z guide to common data terms

This glossary of terms is designed to help leaders stay current with the ways that data is talked about and used in both public and private sectors.

This list provides definitions of key terms. If you’re interested in additional information, like use cases and examples, or want a copy to print out or share, you can download the full version here or by clicking on the link below.

If you have suggestions for terms we should add or definitions, feel free to email us.

The glossary is published with a Creative Commons CC BY-NC 4.0 licence. This means that it can be reused for non-commercial purposes, as long as Mastodon C are credited. License details can be found here.

Download a free A-Z of data terms


    Algorithm

    A process or set of rules to carry out a particular task, for example data analysis algorithms. Often expressed in computer code.

    Analytics

    The discovery, interpretation and communication of meaningful patterns and insights in data.

    Anonymisation

    The process of removing detail from or otherwise transforming data, to avoid any identification of individuals or organisations.

    Artificial Intelligence (AI)

    “Intelligent behaviour” exhibited by machines, for example learning and problem solving.

    Big Data

    Any form of data that due to its size, velocity (rate of change) or complexity pushes the limits of current storage and analytical capability.

    Cleaning

    The task of preparing data so that it can be used for a specific purpose, whether that’s analysis or sharing with others.

    Clojure

    A general purpose programming language used to work on data projects.

    Cloud storage

    Storing data on machines accessed remotely over an internet connection, as opposed to on a machine or server housed in your own building.

    CSV

    A CSV file is a Comma Separated Values file which allows data to be saved in a table structured format. CSVs look similar to a normal spreadsheet, but are reliably usable in more contexts.

    Data

    “A set of values of qualitative or quantitative variables.” (Wikipedia). Information in raw form (such as alphabets, numbers, or symbols) that refer to, or represent, conditions, ideas, or objects. In the context of computing data can be thought of as information that is transmitted or stored.

    Database

    A digital collection of data and the structure around which the data is organized.

    Data Science

    Data science is an interdisciplinary exercise that aims to find useful answers and insights in data by combining mathematical, scientifically robust approaches with computer programming techniques.

    Data Mining

    Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

    Deep Learning

    Deep Learning involves feeding a computer system a lot of data, which it can use to make decisions (Forbes). A subset of machine learning in Artificial Intelligence (AI), Deep Learning develops networks which are capable of learning unsupervised from data that is unstructured or unlabeled. Also known as Deep Neural Learning or Deep Neural Network (Investopedia).

    ETL (Extract, Transform and Load)

    A process in database and data warehousing meaning extracting the data from outside sources, transforming it to fit operational needs, and loading it into a database.

    Formats

    How data is structured and stored.

    GIS

    A geographic information system (GIS) is a system designed to capture, store, manipulate, analyze, manage, and present spatial or geographic data.

    Governance

    The processes, policies and tools that ensure data is formally managed, so that an organisation meets policy, legal, statutory, requirements, and so data can serve the mission and goals of an organisation.

    Hadoop

    An open-source software framework used for storage and processing of (typically) large datasets.

    IOT - Internet of Things

    The connection of ordinary, everyday devices to the internet. Connection of everyday physical objects and products to the Internet so that they can relate to other systems or their data can be used and analysed.

    Linked Data

    A method of publishing structured data so that it can be interlinked and become more useful.

    Licensing

    Data licensing is what tells someone what they can and can’t legally do with a piece of data or software.

    Metadata

    A set of data that describes and gives information about what other data is about.

    Machine Learning

    A subfield of artificial intelligence that gives computers the ability to learn without being explicitly programmed.

    Model

    An abstract construct that organizes elements of data and standardizes how they relate to one another and to properties of real world entities.

    Natural Language Processing

    Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with using computers to make sense of human language.

    Open Data

    Data that anyone can access, use and share, for any purpose, without cost.

    Open Source

    Software for which the original source code or information is made freely available and may be redistributed and modified in various ways depending on its licence.

    Predictive Analytics

    Using data to predict what will happen next - for example what someone is likely to buy or visit, or how something will behave.

    Data Platform

    A place where data is published for use by others.

    Python

    Python is a widely used high-level programming language for general-purpose programming

    R

    R is an open source programming language and software environment for statistical computing.

    RDF

    A data format, RDF is a framework for describing resources on the web. RDF is designed to be read and understood by computers.

    Sentiment Analysis

    Using data and algorithms against unstructured text, to provide insights into what people are thinking and feeling.

    Software-as-a-Service (SAAS)

    A software tool that you access from your browser rather than one that is downloaded and installed onto your device.

    Spark

    Apache Spark is a “fast and general engine for large-scale data processing” (Apache). It was built for speed, ease of use, and analytics.

    Structured data

    Data that is identifiable and easy to use, as it is pre-organized in structure like rows and columns.

    Transactional data

    A very common kind of data which describes events such as payments, events in a system, or appointments, often held in a large database or data warehouse.

    Unstructured data

    Unstructured data is data that is in general text heavy, but may also contain dates, numbers and facts.

    Visualization

    Representing data, or relationships between data, in a visual manner so as to communicate a finding, relationship or story.

    Velocity

    The speed at which the data is created, stored, analysed and visualized.

    XML

    XML stands for Extensible Markup Language (XML).

    Content made available under the Creative Commons CC BY-NC 4.0 licence. Find full licence details here.