Domain:
Digitalisation; Infocomm Technology & Smart Systems
Remark:
You should have attended CSC's Data Analytics - Basic Principles and Applications (CRDDA10/CRDDAVL) programme, or are familiar with basic data analytics. After registration, you are required to fill in a pre-programme survey form in order for us to assess your suitability for the course.
Who Should Attend:
You are a Singapore Public Officer who has some basic understanding of data analytics and statistics, and you wish to prepare and process data effectively.
Programme Overview
Do you know that people often spend 80% of
their time cleaning data and only about 20% doing the analysis?
Good, clean data is not always readily
available. The datasets we first encounter usually contain large volumes of
data, stored in formats not easy to use, or contain inaccurate, incomplete or
unreasonable data. Sometimes, there is also the need to combine data from
different sources, which can be a tricky and messy job.
In this programme*, you will use a range of
hands-on activities and case studies to learn how to reverse the effort
required so that you can spend your time analysing the data instead and improve
your productivity.
*This programme is suitable for officers
who need to clean data at work.
Learning Outcomes
- Recognise the importance of structuring data in a tidy data format and applying the steps to convert ‘messy’ data to ‘tidy’ data for data analysis
- Describe the dimensions of data quality – validity, accuracy, completeness, consistency, uniformity – and demonstrate the ability to explore the data quality before starting data analysis
- Describe the data cleaning workflow of inspecting the data, cleaning the data and verifying the cleaning results
- Give examples of tools to use for data cleaning and preparation
- Demonstrate the ability to inspect a dataset to check the validity and profile the data using statistical summaries
- Perform the steps of data cleaning on a dataset using add-ins/ tools available in Microsoft Excel
Last updated:
Principles of tidy data
- Difference between tidy data and structured data
- Importance of tidy data for data analysis
- Converting ‘messy’ data to tidy data
Understanding data quality
- Validity – does the data conform to business rules and constraints?
- Accuracy – does the data conform to standard or true values?
- Completeness – to what extent is the data known / complete?
- Consistency – is the data consistent within same dataset or across multiple datasets?
- Uniformity – is the data specified using the same units of measure?
Iterative workflow / process of data cleaning
- Inspect the data to detect unexpected, incorrect or inconsistent data
- Clean the data to remove or fix the anomalies
- Verifying the results for correctness
- Record and/or report the changes made
Inspecting data (hands-on)
- Analysing the validity of data (eg. data type, range constraints, mandatory and unique constraints, set-membership constraints, regular expression patterns etc.)
- Data profiling using summary statistics
- Data visualisation to find outliers
Cleaning data (hands-on)
Remove irrelevant data
Remove duplicates
Transform data types
Handle syntax errors, especially with text data (eg. remove white spaces, fix typos)
Standardise data formats
Transform data by scaling and normalisation
Deal with missing values
Deal with outliers