A Window into Data Transparency
Here is my own experiment in data transparency in data visualization. It’s from a paper of mine on a disorder called familial disautonomia (FD, in the figure). The paper compared physiological variables from 25 children with the disorder to 25 age/gender/race-matched controls (CN) based on in-home EKG and respiration recordings. The figure is rather dense, so by way of explanation I’ll break it down. There were recordings done during the day (left) and night (right), for the two sets of subjects (upper and lower blocks).
The multicolored blocks are representations of the heart rates for all of the children in the study for all of the time they were studied (abscissa). For the day studies, each child had two studies of two hours each, which are shown as two color bands within each block. There was only one night study for each child, so in that block each band is twice as tall, but each subject’s two daytime bands are adjacent to his/her nighttime band. In addition, in all four blocks, the youngest patients are arrayed at the top of the block, and the oldest at the bottom. Missing or artifact data epochs are shown in white. The colors themselves indicate the heart rate of that child coded as a percentile of his/her matched control (thus, self-normalized for the control children).
Leaving aside the clinical interpretation of the data in terms of the particular pathology, my goal with this figure was to be as forthcoming as possible in showing all the data, warts and all. This required some compromises. The normalization to control values was needed because the absolute heart rate values were too different to be represented on a single color scale. This means that absolute heart rate is not represented, and high or low rates outside the control distribution saturate to the ends of the color map. Also, the full temporal resolution of heart rate changes can’t be represented at this graphic resolution, so information about high frequency heart rate variability is lost in this figure.
I haven’t seen this kind of presentation of data in many physiology papers, though similar figures are often seen in omics research. My goal was to show as much of the heart rate data in as much detail as possible, including covariates that were not specifically addressed in the results (like age). Aside from the general incentive for transparency, meant to let the reader to assess my conclusions, I hoped that this format would allow readers to engage their own hypotheses with a large dataset. Of course, this openness also exposes the data to a level of scrutiny that could be avoided with summary statistics. The readers can see exactly how much data had to be thrown out, for better or worse.
I’m curious to know what people think of this approach. Is the goal of transparency reasonable, or is this data-dense figure more of a distraction?