Using visualizations to explore data and pose questions.
Overview of Dataset
The World Indicators Dataset is originally from The World Bank and includes different indicators of the development of a country. This is a large dataset, with data collected across 207 countries, grouped into six regions, from 2000 to 2012 (Tableau). There are four groups of measures with five indicators related to business, seven related to development, five related to health, and six related to the country’s population. The data types included are nominal, interval, and ratio temporal data (Shneiderman, 2003). I’m specifically interested in health and thus focused most of my exploration on the health-related measures to develop my questions and hypotheses.
The quality of this data is high, considering it includes 207 countries and over a decade of data (Tableau). However, one of the first limitations I noticed included the fact that the data presented includes only countries and regions. As Hans Rosling explained in his TED talk, it is important to go beyond the compiled averages of values when analyzing data across a country. Averages may not convey the whole story because there may be systematic differences across a country (Rosling, 2006). But the sheer amount of multivariate data present in this dataset allows users to conduct various explorations of different relationships and correlations. Another limitation includes reminder for users to pay attention to the null values present in the dataset and how the lack of data from those countries for certain indicators may impact the data story of that country being conveyed.
Data Exploration and Visualization
I started my exploration of the data with a subset of measures of interest for the year of 2012: birth rate, health expenditure per capita, health expenditure percent of gross domestic product (GDP), population urban, and overall GDP. I was interested in exploring any correlations or relations between these measures and infant mortality. Infant mortality is a unique indicator of development it can be indicative of a country’s community health status, socioeconomic status, quality of healthcare services, and accessibility and availability of those healthcare services (Why Focus on Infant Mortality).
First Question of Interest :
How does the health expenditure per capita of a country is related to the infant mortality rate? I hypothesized that as health expenditure per capita increases, the infant mortality rate will decrease.
Refining the Question
With the large nature of this dataset, I realized I need to specific in my question and hypothesis development. Initially, I was interested in checking if there is a correlation between health expenditure per capita and percent GDP. I had created a plot with % GDP on the x-axis and differences in position of circular points encoding the expenditure per capita on the y-axis. The infant mortality rate of the country is encoded in the size of the circular point. However, I quickly found countries were interestingly very different in their per capita vs percent GDP expenditures. This led to numerous observations, leading to more questions. What does it mean when the United States has both a relatively high per capita and % GDP expenditure? Alternatively, what does it mean when countries like Liberia and Sierra Leone have high expenditure per GDP, but low expenditure per capita? Additionally, Equatorial Guinea seems to have the highest expenditure per capita across all the countries in Africa; however, other countries such as Libya and Seychelles with similar expenditure per GDP but lower per capita have lower infant mortality. After more exploration, I realized I would ultimately need to read more into health-related metrics of development in order to understand the context and explain the results surrounding insights gained from further refining a visualization comparing health expenditures per capita vs percent GDP (Why Focus on Infant Mortality). Thus, I decided to focus solely on the health expenditure per capita instead of % GDP .
I first tried to use different means of visualizing to explore the data and started with a map, using the longitude and latitudes of the countries, to understand how expenditure per capita impacts infant mortality rates. After generating a choropleth map visualization (Scott, 2020b) with infant mortality encoded in the differences in colors and health expenditure per capita encoded in the size of circle points on the country, I realized it was difficult to compare the exact infant mortality rates between countries; however, it was still noticeably evident that there was a subset of countries in Africa and Asia which had relatively higher rates. Furthermore, countries in Europe, North America, and Australia seemed to have generally higher health expenditures per capita. Upon reflection, I realized I had a preference towards the overviews generated from my previous visualizations, during the initial exploration phase, utilizing circular points and differences in position.
Finalizing the Visualization:
This visualization answers the proposed question by first showing us an overview of all the countries. From the overview, I concluded that the relationship between health expenditure per capita and infant mortality rates looks like the graph of 1/x for positive values of x. There is a negative correlation present in which as the health expenditure per capita in the x-axis increases, the corresponding infant mortality decreases. The negative correlation does not look entirely linear. The overview of the visualization also is also helpful in identifying the countries that pop-out from the overview, fostering pre-attentive perception. There is also a pop-out effect for countries with either high mortality rates, like Sierra Leone or Angola, or high health expenditures per capita, like the United States, and countries that do not follow the 1/x shape of a majority of the countries, like Equatorial Guinea (Scott, 2020a). Furthermore, having the option to filter and focus on specific countries of interests, such as the aforementioned countries noticed through pre-attentive perception, helps better visualize the varying different relationships and trends between health expenditures per capita and infant mortality (Scott, 2020a).
Second Question of Interest :
How does the health expenditure per capita of a country changes with GDP over time for Angola, Israel, Japan, New Zealand, Switzerland, and the United States. I hypothesized that as there were peaks and increases in GDP, the country’s expenditure per capita for health would also increase.
Iterating Design of Visualization
My initial question started out more general. Since my first and second question are related, I had a better understanding of the question I wanted to ask subsequent to the first. But as I went through different iterations of my visualization, I decided to focus my questions on a handful of countries of interest.
Thus, the iterations involved were more focused on the making of the visualization. I wanted to take advantage of the multivariate nature of this dataset and encode a couple variables to provide context around my question of interest. From reading about infant mortality, it was evident that infant mortality is a complex indicator with many influencing factors (Why Focus on Infant Mortality). Although the priority and focus of the visualization design would be health expenditure per capita and GDP, I also wanted to explore including other potential health measures of interest: infant mortality, overall GDP, birth rate, and population urban, population 0–14.
For GDP and health expenditure per capita, I started designing plots with a similar format to the first visualization. However, using circular points and their changes in position led to clutter in the overview of the visualization and it was difficult to identify patterns of individual points. For the scope of the question, I did not want to create a visualization similar to a parallel coordinate plots where the focus of the data is not on the individual points (Few, 2019).
Thus, I decided to focus on redesigning the visualization with a focus on encoding variables with either color and area. Stephen Few, in his discussion of limits present in multivariate data visualization, highlighted color intensity and area to be the most useful when encoding quantitative information. Few also specified that minimizing clutter, especially in the graph’s overview view, will lead to cleaner and simpler displays that are ultimately easier for users. (Few, 2019). I started with visualizations organized in a trellis display with a common x-axis of years from 2000 to 2012 (Few, 2019). Each country would be a single column with multiple rows of simple visualizations.
I continued to add more rows for health expenditure % GDP, birth rate, and population 0–14. My next refinement included deciding whether or not to add bars, with length encoding the corresponding quantitative variable on the y-axis. With six plots in each row for a country in the column, the overview did not have any pop-out effect or foster pre-attentive perception. The plots and shapes were merging together. So, I added bars in the last three rows as a balance for continuing with the theme of simplicity, but also providing a pop-out difference between the six plots.
Finalizing the Visualization:
This visualization answers the proposed question by first showing an overview of the six countries of interest. From the overview, there is a pop-out effect of the different shapes present in each row. However, noticeably, the high variance in the GDP of the countries in the group made it difficult to notice the shapes present in the GDP of certain countries. By utilizing the zoom and filter feature, I was able to gain a better understanding of the countries with smaller GDPs compared to that of the United States.
All countries had generally similar shapes for their health expenditure per capita and GDP. Thus, as GDP increased, countries had greater health expenditure per capita. The multivariate nature of this visualization also led to other insights. Interestingly, Switzerland’s health expenditure % GDP stayed fairly constant through the years. On the other hand, Japan increased its expenditure % GDP from 0.076% to 0.101%.
Given the scope of this assignment, these visualizations were designed to address a specific question and test a hypothesis. Future iterations of these visualizations, for more general use, may include various other features as recommended by Jeffrey Heer and Ben Shneiderman in their discussion of “Interactive Dynamics for Visual Analysis”. The visualization needs to support the derivation of a subset of data from the source data along with more extensive features for manipulating the view of the data. A potential feature may include a workspace area, separate from the visualization, for users to manipulate their desired subset of data. Furthermore, more resources will be provided to support the annotation of patterns and collaboration between teams (Heer & Shneiderman, 2012).
- Few, S. (2019). The Perceptual and Cognitive Limits of Multivariate Data Visualization (pp. 1– 15). Perceptual Edge.
- Heer, J., & Shneiderman, B. (2012, February 20). Interactive Dynamics for Visual Analysis. Acmqueue, 10(2), 1–26.
- Rosling, H. (2006, February). The best stats you’ve ever seen. https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen/transcript
- Scott, T. (2020a, April 28). COGS 128: Information Visualization Lecture 5.
- Scott, T. (2020b, May 12). COGS 128: Information Visualization Lecture 7.
- Shneiderman, B. (2003). The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In The Craft of Information Visualization (pp. 364–371). https://doi.org/10.1016/B978-155860915-0/50046-9
- Tableau. (n.d.). World Indicators — Tableau Extract [Preloaded Tableau dataset]. Retrieved from http://data.worldbank.org/indicator/all
- Why Focus on Infant Mortality (Infant Mortality Toolkit). (n.d.). Association of Maternal and Child Health Programs. http://www.amchp.org/programsandtopics/data-assessment/InfantMortalityToolkit/Documents/Why%20Focus%20on%20IM.pdf