Effective planning to use data for an outstanding output
Identifying, collecting, analysing and reporting data are crucial
Dr. Annapoorna Ravichander & Mrinalini Kabbur
Think Tanks, as research organisations, conduct robust research to provide evidence-based, actionable recommendations to policymakers and governments. Effective data handling — identifying, collecting, analysing and reporting — is crucial.
The following steps were used to work on a project on malnutrition in India. Think Tanks and research organisations should follow a systematic approach to ensure a well-defined project and smooth workflow. The following steps were used to work on a project on malnutrition in India.
Step 1: Problem definition
The problem statement stage is the crucial first step in solving an analytics problem. When clients present their issues, they often describe them in layman’s terms, which are not immediately actionable from an analytics perspective. Therefore, the problem needs to be reframed to be SMART: Specific, Measurable, Achievable, Relevant and Time-bound. Understanding the client’s pain points and conducting a literature survey are essential tools for accurately framing the problem.
For example, the patterns of child malnutrition and the influencing factors across the States and districts of India were identified and understood.
Step 2: Conceptual and data framework
A conceptual framework is an analytical tool to understand a phenomenon comprehensively. It defines the study’s variables or concepts and maps out their potential relationships, typically based on a literature review of existing studies and theories. This framework is often presented visually or in writing. Constructs, which are abstract and not directly measurable, are transformed into variables through measurable indicators. This is known as an Indicator or Data Framework.
For example, the conceptual framework guided by the UNICEF framework for malnutrition was adapted to suit local requirements. Malnutrition is defined through child nutrition and child anaemia. The immediate factors leading to malnutrition include child diet and child disease. Underlying factors consist of household profile, mother’s education, and empowerment. Child nutrition is measured using indicators such as stunting, wasting, and underweight in children aged 0-5 years. Once the indicators for the whole conceptual framework were identified, data was collected for each.
This step involves accessing and ingesting data, starting with identifying the required data, guided by the indicator/data framework. Data can come from various sources in structured, semi-structured, and unstructured formats, such as CSV, Excel, XML, flat files, JSON and relational databases. Key data sources in governance include:
- Open data initiatives (e.g., mydata.gov.in)
- Internal government data
- Government and private surveys
- Satellite imagery
- Web-extracted data
After identifying the data, it must be imported into analysis tools, consolidating structured, semi-structured, and unstructured data into a common repository for analysis.
For example, the National Family Health Survey 5 (NFHS-5) [2019-2020] is used for the analysis. The malnutrition analysis was performed for 707 districts, 28 States, and 8 Union Territories.
Data preparation is the process of cleaning and transforming raw data before analysis.
The data cleaning addresses the issues related to Missing Data, Duplicate Data, Irrelevant Data, Outliers, Structural Errors (“N/A” and “Not Applicable” appear in the same column but they are the same) etc.
Data transformation primarily involves the following:
Data Scaling: Necessary when variables are on different scales. For instance, housing data may have house sizes in thousands of square feet and prices in lakhs or crores. Scaling methods include standardisation (Z-score) and normalisation (min-max).
Creation of Dummy Variables: Analytical techniques work best with numerical data. For categorical data, dummy variables are created. These are numeric variables representing categories (e.g., gender, race) with values of 0 or 1. To represent a categorical variable with k values, k−1 dummy variables are defined.
Merging, Splitting, and Joining: Data often needs to be divided into subsets or combined. For example, merging datasets containing student demographics with academic performance helps analyse performance by age and gender. Additionally, data can be split into training and testing sets for model validation.
For example, the NFHS-5 data was collected at the individual level. District and State-level aggregates were computed using the Guide to Data Demographic and Health Surveys (DHS) reference document.
Step 5: Exploratory Data Analysis
Descriptive Statistics: Summarises the dataset with univariate analysis. Five types include:
Measures of Frequency: Count, percentage, etc.
Measures of Central Tendency: Mean, median, mode.
Measures of Dispersion: Range, variance, standard deviation.
Measures of Position: Quartiles, percentiles.
Measures of Shape: Skewness, kurtosis.
Visualisation Techniques: Represents data visually, making it easier to understand and remember. Includes:
Univariate Visualisations: Pie chart, histogram, boxplot.
Bivariate/Multivariate Visualisations: Scatter plot, bar plot, heat map.
Correlation Analysis: A bivariate statistical technique showing the strength and direction of relationships between pairs of variables.
Inferential Statistics: Uses sample data to infer about the population. Hypothesis tests, such as T-tests, Chi-Square tests, and ANOVA, are used to answer questions about effectiveness and relationships, like the impact of a new drug or the benefits of a professional degree in data science.
For example, Cluster analysis was conducted to identify malnutrition patterns in anthropometry and anaemia levels in the States. The Distance to Frontier method was used to identify poorly performing districts and States. Further correlation and regression techniques were applied to determine the influencing factors.
Step 6: Models and Algorithms
Data science models are mathematical algorithms or statistical methods used to analyse data and make predictions or decisions. Machine learning models, widely used today, develop algorithms by learning hidden patterns in historical data to make predictions on new data.
Types of Machine Learning Algorithms:
Supervised Learning: Machines are trained using well-labeled training data to predict outputs based on that data. Examples include:
Regression Models: Linear regression.
Classification Models: Decision tree, random forest, logistic regression.
Unsupervised Learning: The data is not labeled, and algorithms find hidden patterns within the dataset. Example:
Cluster Analysis: A popular unsupervised learning algorithm.
Other Models:
Time Series Models: Used to model time series data for understanding patterns and forecasting future values. Example: Autoregressive integrated moving average (ARIMA).
Dimensionality Reduction Models: Reduce data dimensionality while preserving variance. Example: Principal Component Analysis (PCA).
These models are particularly popular in the public policy domain.
Step 7: Reporting
Data dashboards are widely used for reporting data analytics results. They can be created using tools like Rshiny in R or Plotly in Python. Additionally, Business Intelligence tools such as Power BI and Tableau are popular for dashboard development. Results can also be communicated through detailed reports.
The steps outlined for data analytics above constitute an iterative process until the results are accepted and approved by clients and stakeholders.
For example, results were communicated through detailed reports that included major findings and recommendations.
(Dr. Annapoorna Ravichander is Professor of Practice, Department of Public Policy, Manipal Academy of Higher Education, Bengaluru Campus, and a freelance consultant)
(Mrinalini Kabbur is a freelance consultant-Data Analytics)