White Paper

American Behemoth: Our Trillion-Dollar Healthcare System

When I first conceived of this project, I came at it from a conversation with a close friend of mine about how her own friend visited nine doctors, and then finally got a diagnosis when he visited the tenth—and still had to pay for his care after paying for the doctors’ services. Coming off of that conversation, I thought the real story here was about medical debt, but when I went to go find data, I instead ran across how much healthcare costs on the whole. The data on healthcare costs was deeply compelling; I found myself looking at the tables, asking “It’s going to go up by how much?” and realizing it wasn’t going to stop. One of my primary goals in creating this project was to try to make concrete the enormity of how much we spend on healthcare, and the story unfolded from the premise of inherently falling short of my goal: no one can really grasp how much trillions of dollars is, which is when I realized that that is precisely the point.

American Behemoth takes a look at healthcare on a national health expenditure level, and then breaks it down into: 1) insurance/out-of-pocket spending; 2) percentage shifts in that spending; 3) categorical healthcare spending; and 4) a combination of the two types: categorical spending by source of funding (insurance/out-of-pocket). The research questions I ended up formulating were: What are Americans more likely to be paying out of pocket for? How much of insurance is private versus public versus paid for by “third party providers”? Are any healthcare costs projected to fall instead of rising? The short answers ended up being: Other Non-Durable Medical Products, a lot versus a moderate amount versus a tiny amount, and not in the slightest. At the final review, I even had more than a few people tell me outright: “I’m sorry, I just can’t get a handle on how much this is.” “You said trillion?” “It does just keep going up.” “This is depressing.” 

The data that I worked with comes from the Center for Medicare and Medicaid Services and includes an amalgam of data from a few governmental offices, including the Office of the Actuary. Out of 17 tables, I ended up consolidating 15 into four separate datasets that I linked to my Tableau for cleanliness purposes. Those four datasets contained very similar information, just broken down in complicated ways. One of the challenges I faced with regard to the data was making sure I kept it clean and sanity checked myself frequently. A couple of times, I found that I had pivoted tables and immediately ruined the data. The data provided a challenge even after the first iteration, in which I realized that if I wanted to create an effective visualization, I would need to separate and pivot certain parts of the data. The fact that I had consolidated around 15 of those tables into four different excel spreadsheets really hammered home how careful I would need to be with the numbers. In the end, the creation of those four consolidated datasets felt like perhaps, a bit much—however, creating my visualizations, it became clear that each of the datasets meaningfully contributed to the points of each of the charts.

In total, I made 12 visualizations, consolidated into five dashboards, which were all fitted into one story sheet. Six of those visualizations were the categorical breakdowns, which I turned into a selectable view for the most detailed breakdown in the entire project. Compared to the prior iteration, which only included four visualizations in two dashboards, the final iteration was a lot more complex and accomplished my goal of telling a story with the data. My visualizations included: 1) a scatter plot with a money bag shape where a circle would be; 2) an area graph; 3) a heatmap; 4) two sets of area graphs, split by category and meant to work in tandem; and 5) several treemaps that accomplished the goal of first showing the categorical breakdown by percentage, and then showing that categorical breakdown by insurance or non-insurance type. For example, if you want to see what percentage of our overall spending in 2019 would go to hospital care, you would look at the first treemap, and then if you wanted to know what percentage of that comes from people’s own funds, you would go to the next section and select the year 2019 in out-of-pocket costs.

The area graph covered a specific portion of the graph per category and showed a breakdown of each insurance type, where each amount per year added up to the total expenditures. I color coded each category to attempt to match the heatmap on the second page of the story. I felt that creating an area graph prepares the viewer to understand the heatmap, which is then broken out into percentages to attempt to show the minute movements. The heatmap itself is essentially a fancy table with colors that get darker the larger the corresponding number is. In this case, it was fairly simple: over time, all of the categories shifted down the color scale, because every single category just kept increasing. It’s slightly clunkier to explain the two area charts working in tandem; one of them is the same graph, just made to show smaller changes when the first graph is interacted with in a dashboard. 

Essentially, the reason I chose another area chart instead of, for example, a line chart is that I felt that the line chart would just not show the same density of amount. The point of the project is to show that healthcare costs are rising astronomically, and choosing a line chart would have felt emptier, in a sense, than the area chart. For the same reason, I chose to use treemaps to represent the categorical parts of a yearly whole; a treemap uses size and color to show the portions of a selected amount, and in this case, my decision to turn each insurance type/overall total into a percentage ended up making something actually comparable. A lot of my visualizations ended up being either parts of a whole or on some kind of Cartesian plane, which really speaks to the topic at hand. The struggle I kept coming back to was: how do I represent such a huge system in such a small amount of space?

Representing the fact that the data was made up of historical estimates and projections was difficult, and I constantly felt clumsy pointing this out on my graphs—however, one of my favorite things that I did to show the projections was creating a blank annotation, stretching the square out to cover the part of the graph that was projected, and then reducing the opacity to 15% to give the effect of slightly faded colors. It’s one of the few things that remained the same from start to finish, and one of the few things that was also very finicky when I was first starting out (it would disappear when I was trying to figure out what scale I wanted to use for the scatter shape chart).

A lot of the time, I made sure to follow Tufte’s principles, removing anything that was unnecessary for the graph to be understood. The feedback I received at the pin-up was absolutely invaluable for telling me what that was: certain legends were necessary, and others weren’t. Definitions were required for understanding the idea of the healthcare system, as were instructions on how to engage with the graphs themselves, and though they ended up taking up a lot of space, they were absolutely essential, and without them, people would be asking questions about what the data actually meant. These questions were not something that could be answered with the data alone, so I decided to try to make those definitions as unassuming as possible. 

At first, I wanted to try to have the first set of definitions pop up when someone hovered over a certain portion of the graphs, but I couldn’t find any way to create an annotation that popped up on a hover-over. Finally, I decided that if I simply created a balanced view on the first story pane, it would still look aesthetically pleasing and I would also accomplish my goal of creating definitions. The second set of definitions was much easier, as I built them right into the area graphs, but at the review, I ran into an issue that doesn’t seem to have an immediate solution: sometimes, people were completely ignoring those annotations and moving on to the tree maps, and then asking me what each categorical expenditure meant. This only happened twice, when I had larger crowds, so I have a feeling it has something to do with being able to really sit with this story. The instructions were fairly straightforward—I included them only for things that I felt weren’t immediately obvious, like my final interactive piece. Because I couldn’t include titles (the dropdown view toggle wouldn’t work otherwise), I had to point the viewer to the dropdown menu, indicate what the default view was, and then point out that there was a year filter.

One of the things I most wanted to do, from the very beginning, was tell a color story using two complementary colors: green to represent money/greed, and red to represent health/bodies. A lot of the time, green can signal something good, while red can signal something bad; however, by immediately associating green with money bags, I asked the viewer to choose to see it as something financial rather than any of its other associations, and from what I heard at the pinup and review, it seems to have worked. The only thing that mildly concerned me with this color story was the fact that, as per Nathan Yau’s Data Points, red is not exactly accessible to those with color-blindness. However, theoretically, the color-blindness issue only comes into play when there are other colors in addition to red in the same visualization from which to differentiate. The saturation in the heat maps was varied enough that the main function—showing less and more—remained, so I kept it. Additionally, the urgency inherent in the color red convinced me that this was a necessary component to the visualizations’ story.

With regard to font, I settled on a mix of sans serif for the story title, serif for the dashboard title, and sans serif again for the individual graph titles and annotations. My intent was simply to create an aesthetic contrast, something eye-catching and interesting so that my viewers won’t get bored. A difficulty I ran into was when I was sizing everything for the final review as opposed to publishing on Tableau Public—though the former allowed me more space to stretch everything out, I also had to remember that people needed to be able to read everything, especially from a distance. For that reason, a lot of the instructions had large font sizes, while the descriptions had slightly smaller font sizes. When publishing to Tableau Public, I found that my definitions and instructions were getting cut off, so I reduced font sizes, made some of the graphs smaller, and set everything to a fixed size so that I could control what the view looks like. When it came to tooltips and ancillary annotations, I varied font size and color in shades of gray/black in order to emphasize the information I thought was most important—this was usually years, categories, and insurance types. I wanted ancillary annotations to be legible but unobtrusive, so with regard to things like sources, images, and citing the methodology paper, I found smaller corners and font sizes in which they wouldn’t offend the eye, but would still be accessible.

The biggest challenge I faced overall with regard to editing the visualizations was getting a treemap to become small multiples. One of the points of feedback I received at the pin-up was that it was too difficult to understand the out-of-pocket treemap I had made, because it combined each of the years. In my first iteration of this visualization, you would not have a side by side comparison of the categorical breakdown by year—to compare the years, you would have to first combine them, which made it more confusing in the long run. I did a lot of research on this in attempting to address the critique, and as far as I can tell, there is no real way to create “small multiples” of a treemap, at least not in the traditional sense. I tried a few different things before I finally found a “toggle” explanation, which accomplished my goal of swapping between different worksheets that I had created. Though I had to compromise the integrity of having titles above each of the treemaps, the fact that my “small multiples” ended up working at all was possibly the highlight of that iteration.

Finally, I took some time to think about the title of each dashboard as well as the overall story title. During the pin-up, I got a piece of feedback from one of my classmates about my titles, saying that I should swap my first dashboard’s title with the story title. This is where I really thought about what I was saying with this project, because I was wrapping the entire story up with a bow. I knew I was trying to show just how high healthcare costs are and how much higher they’re going to get, but until I had to rename my story, I wasn’t sure how to present it. I settled on the idea of a monster, something so vast and archaic that it could make the attempt at coming close to describing the United States’ healthcare system. Coming from the idea of “my friend’s friend ended up paying way too much for healthcare” and moving towards “this system is genuinely going to continue to grow larger and larger until we do something,” I purposely made all of my titles reflect this, and the result was a more cohesive end than the previous iteration had been.

This project pushed me beyond my comfort zone when it came to working with Tableau, and participating in the pin-ups especially encouraged me to seek out more advanced ways of creating visualizations using this software. The hands-on approach, figuring out how to make certain tutorials work with my own visualization goals, allowed me to cement these skills in a way that only working with theory would never have done. Content-wise, my classmates, TA, and professor really motivated me to think outside the box and go beyond aesthetics to really tell a story that might make a difference. In the end, I think that because of all of the challenges I faced in creating this project, I’m way more confident using Tableau to tell stories with data, and in the future, I hope to continue creating more data stories that make information accessible to a wider audience.


This project topic would not have been what it is without a conversation I had with one of my closest friends about the heaviness of how profit corrupts care and how the high cost of medical care could hit any one of us at any time—thank you, Allie!

Additionally, huge thanks to my pin-up group, our class’s TA Andi Cupallari, and Professor Michelle McSweeney, who all really pinpointed what could be made better and helped me find the sculpture in the marble of my pin-up iteration. I’m really proud of this project, and it wouldn’t have come together as cohesively as it did without the insight and help everyone offered. Thank you all!

Final project reference:


The United State of American Healthcare Costs

When we hear about Medicare for All, proponents of the policy are talking methods of paying to take care of our health. One of the reasons that Medicare for All is being put forth is that healthcare in the U.S. is so expensive that it’s one of the major reasons people regularly report avoiding the doctor’s office. It seems like every year, the cost of healthcare goes up, and I’m here to tell you: you’re not imagining it. Every single year, we are paying more and more to take care of ourselves, and it’s projected to get even worse over the next ten years if we don’t do something. 

While putting this project together, I asked:

What are Americans more likely to be paying out of pocket for? How much of insurance is private versus public versus paid for by “third party providers”? Are any healthcare costs projected to fall instead of rising?

My audience for these visualizations consists of Americans who are concerned about the cost of healthcare. As healthcare costs rise, more people will likely have to pay attention to their expenditures, and I hope that being able to see these projections could help them prepare for an uncertain future in managing their healthcare costs, or even put them in a position to lobby Congress for regulation of the health and/or pharmaceutical industries in general.

I made a total of 5 visualizations for this project, which collectively tell a story about where our healthcare costs are going. All visualizations, due to the data, contain historical estimates from 2011-2017 and projections from 2018-2027. The first visualization is a simple shape chart, showing an estimate of how much we have spent on healthcare on a yearly basis from 2011-2017, and then the projection of what 2018-2027 will look like. We go from spending around 2 trillion in 2011 to 5 trillion in 2027, and these numbers have been deflated. The second and third visualizations are kind of paired—they both show a breakdown of where all that money comes from, but the area chart shows it without numbers while the heat map table shows it with. The area chart also performs the function of showing which paying options (insurance/non-insurance) will pay less and which will pay more, which is less easy to see in the table. However, the heat map table does the job of asking you to confront the amount of zeroes in each of the categories, which the area chart was not quite built for.

In my second set of visualizations, I broke down what the money was being spent on, first in a tree map showing out of pocket payments, and then in an infographic showing overall what we are spending on. The tree map can be filtered out by year, and uses percentages as opposed to numbers so that the viewer’s mind can more readily grasp the parts of the whole. The tooltips then show the actual amount in billions of dollars. The infographic uses the “medical id” symbol and represents amount spent/projected to be spent by size. This is particularly useful when sorting the fields, showing overall what is spent and providing a useful measure against the out of pocket visualization—Americans don’t usually pay for hospital costs out of pocket, but they account for the most spending. The second most in both visualizations is physician care, showing that our doctor’s visits are largely paid out of our own funds.

The data itself was fairly clean, and I was able to consolidate multiple datasets into two main datasets: one consolidated different sectors of healthcare (physician services, hospital care, etc) divided by the type of medical payment (insurance types versus out of pocket) and the other consolidated demographic data, population data, and breakdowns by federal and state government expenditures. I ended up using the first dataset mostly, because the categorical breakdown was narrower in scope and more interesting than the federal/state government. 

Regarding design choices, I wanted to play with custom shapes and palettes for this project, so my first visualization uses money bag icons, and my second uses a custom palette. I tried to separate color choices by dashboard/story beat: green palette for the money, red palette for categorical medical breakdown. The two colors are contrasting as well, establishing a visual dichotomy. A lot of my data represents parts of a whole, which really tempted me to make a pie chart, and I almost caved, but in the end, it just felt too reductive.

In terms of limitations, I was attempting to make an infographic visualization regarding what one person on average pays for medical care in the US, but I couldn’t get the calculated field right. Additionally, I was looking into making a waffle chart, but what I had in mind was too complicated for Tableau.

For next steps, I would like to create a story section about individual breakdown—it’s a little dangerous, but I think it would punctuate the prior two sections by really showing how much money trillions of dollars breaks down to per person, and then making the point that some people don’t have medical debt, so that number is actually higher. Considering the data from 2011 says that healthcare spending per capita was around $8,000, I really feel like this section is necessary.

Patterns of Documentation: Visualizing my Goodreads Page

I’m a bit of an avid reader, or so I thought before I undertook this project—when I pulled my Goodreads data, I was shocked to see that there were significantly fewer books than I remembered reading. I’ve been doing reading challenges for a few years now, and they all ran along the lines of 50 books each year, so why did I only have about 185 records? When I dug into the data, it became clear: I hadn’t actually used Goodreads to document my reading challenges prior to 2018. 

This turn of events led me to my research question:

What did my usage of Goodreads look like before and after I started using it to document every single book I’ve been reading, and what publishers and authors did I most document?

My audience for these visualizations is pretty limited to my friends and family. My friends in particular are in the literary community, would have a lot of context for the data in question, as I talked a lot about these books while reading them, and they would probably be able to tell that I hadn’t documented the data completely.

The below is a dashboard consisting of three separate visualizations of data I documented on Goodreads—1) a bar chart of the books I documented, split up into month and year; 2) a bubble chart of the publishing imprints I documented most; and 3) a tree map showing the authors with the most books I documented reading. They can be filtered by year, and the colors for the bubble chart represent the average amount of pages.

Overall, I wanted a layout that was organized, neat, and as legible as I could get it. I chose the primary color scheme as a kind of homage to the yellowing of book pages as they grow older. The below charts are the visualizations from the dashboard, broken down individually.

The bar chart is the most interesting visualization overall in terms of being able to see usage—2018 and 2019 are the longest sections in it, with significantly higher bars, because that’s when I began using the built-in reading challenge on Goodreads, which allowed me to note down when I finished a book, and counted it towards a real, visible goal. I chose a bar chart to represent this rather than a line chart or a scatter plot because the bar chart really illustrates how the data gets more full around 2018 and 2019. I also chose to put the number of books above each bar to make it easier on the eyes.

The moment I thought of representing publishers, I felt the call of a bubble chart. It’s whimsical in a way that publishing can feel when engaging with their social media, and really represents the idea of big imprints versus smaller ones. When I created it, it was all one color, and I decided to add a variable for the average amount of pages per publisher, which created some surprising results, like highlighting a much smaller publisher (Gollancz). The filter can do a lot of work here—for example, Tor is at the same size level as Orbit, which is an imprint of Little, Brown (one of the Big Five), but if you add 2013, it becomes much, much bigger.

The tree map ended up being one of my favorites, because like in the bubble chart, the filter significantly changes what you see—in 2014, I documented a lot of what I read by Brandon Sanderson, but he drops almost completely off the map if you exclude 2014, and the forefront of the map gets mostly taken up by female authors. I chose a tree map because I wanted to be able to see the authors’ names, while also seeing just how many books I’d been reading and documenting by those authors.

Most of the challenges I ran into with the data were tied to the cleaning, as I suspected when I began this. There was a lot of missing data, some of which I just couldn’t find or get a hold of. I tried to find the missing data, first contacting the NYPL to see if they had my checkout history, and then combing through my old social media posts to see if I could find mentions of the books I read (which was surprisingly exhausting, emotionally). However, doing that made me realize I was working with data that was not representative of my reading habits as a whole, but rather of my documentation habits of my reading for this one specific website. It’s one thing to read a book, and it’s a whole different thing to sit down, search for the book, mark it read, remember what dates I read it, and then star and review it, especially back when the mobile app wasn’t as usable. I had actually listed out the books in my Notes app instead from 2014-2016, and thought about pulling that data in, but in the end I found that there was a much more compelling story in the fact that I’ve started to actually use Goodreads to track my reading over the past two years.

One of the things I wanted to do with this project that I ended up scrapping for lack of time and sheer lack of data was a visualization of my actual reading habits, so I’d love to be able to dig deeper and compare my actual reading habits to what I was able to document on Goodreads. From what I can see so far, there is some crossover between the data, but the missing data amounts to around 40-50 records for 2014, 2015, and 2016 combined. My estimate for 2017’s missing data is around 30-40 records as well, so it would be really interesting to pull up an area chart of both sets of records to compare them. If I could also list out which imprints belong to which of the “Big Five” publishing companies, I think that would be a really compelling visualization.

Homeless Outreach in NYC: 311 Reports About New York’s Vulnerable Populations

Over the past few years, the De Blasio administration has taken a special interest in addressing New York City’s homeless population via outreach methods,

Over the past few years, the De Blasio administration has taken a special interest in addressing New York City’s homeless population via outreach methods, namely in the Home-Stat project, which aims to encourage New Yorkers to report any homeless individuals they might see on the streets to 311, so that an outreach team can attempt to assist the individual. 

In working with the 311 data, I asked the following two-part question:

Over a five year period, what boroughs have had the biggest changes in homeless assistance reporting, and have more people been accepting assistance on a yearly basis, or has there been a relatively static trend?

The following visualizations aim to help New Yorkers understand if it is helpful to dispatch an outreach team to address homelessness in the city, as well as the boroughs in which we do the most reporting. Additionally, they will help the DHS understand what times of year they should recruit for their mobile outreach teams, and when they should do more preparatory work since the reporting is not as robust.

Figure 1, “311 Requests for Homeless Person Assistance by Borough, Sept. 2014 – Sept. 2019” shows the monthly amount of reports made to 311 by New Yorkers in each of the five boroughs. From 2014-early 2016, there was a slow upward trend in reporting, which then skyrocketed in March 2016. 

Though we haven’t seen a boom in reporting like that since, the amount of reporting is still much higher than it was pre-2016. Additionally, less reporting tends to happen in the winter months–from around December to March, while most of the reports tend to happen in summer months, usually in August. 

Despite—or perhaps because of—Manhattan having the least amount of space, most reporting happens in Manhattan overall. The population density in Manhattan might be one explanation for these thousands of reports over the five year period, but other factors might contribute.

In Figure 2, “Yearly Reports on NYC’s Homeless Population by Resolution”, there are several resolution descriptions provided by the DHS with regard to the 311 complaints—my research question focuses on those who accepted assistance, and using the heat map to show the trend of numbers, you can see that there was a significant drop of assistance acceptance after 2017. However, in context, if you turn your attention to the row right beneath it, there was a significant increase in 2018 in resolutions that end right at the point of offering assistance. 

Since each individual report only gets one resolution, it’s not likely that the number is enveloped in the “accepted assistance” row or the “did not accept assistance” row.

Additionally, much like in Figure 1, you can see the spike in reporting in 2016—along with the significant drop in years afterward. Most of those reports from Figure 1 concluded that the individual could not be found.

The other graph, Figure 1, was much less finicky—I decided to go for a stacked bar chart with a monthly breakdown rather than a yearly breakdown because the trends are much more visible over the months, and you can see how they shift with the boroughs, which was an essential part of my research question. Additionally, you can very clearly see the summer months versus the winter months, which ended up being an interesting point I hadn’t considered in my research question.

Working with the data, it became apparent to me that the values I thought were most useful and fascinating were also the most messy. It was definitely a challenge to find the right way to visualize that data in particular. I started with a line graph for what ended up becoming Figure 2, but because the data was so messy, I concluded that no one would be able to read the graph, so I went for a more legible visualization. I was hesitant about the heat map because it is a little more difficult to show trends—colors are not the greatest way to do so—but because the resolution descriptions were inherently strings, everything got very cramped and cluttered if I chose any other way.

With more time to clean up the data, it would be great if I could aggregate the resolution descriptions that were input improperly. Additionally, after aggregating the resolution descriptions, I would make the graphs integrate a little better—breaking the heat map up by months instead of years. I also wanted to compare this data with the data on homeless encampments—some of these cases were referred to the NYPD, and I wonder how they interact with that data. Preferably, I would like to have one graph that accomplishes something big instead of two that accomplish two small, distantly related parts of the data.