Sorry, we don't support your browser.  Install a modern browser

Extract data from images#171

We frequently encounter situtations where data gatherers need to extract data directly from figures (for example, extracting a percentage from a bar chart to obtain event counts). Since doing this accurately cannot be easily done by eyeballing a figure, we typically need to use other software solutions to obtain this kind of data

The best method that I know of is to use the ‘digitize’ package for R (https://cran.r-project.org/web/packages/digitize/digitize.pdf), but can be cumbersome for data gatherers and would be very convenient if there was a built in tool in NK. A good example of digitize in use is found here: https://lukemiller.org/index.php/2011/06/digitizing-data-from-old-plots-using-digitize/

What would make the digitize tool even better is if you could:

  1. Take a page and calibrate multiple figures at once without having to set up a new window
  2. Hover over a plot and see the x/y coordinates as you move across the image/page
  3. Automatically extract a coordinate by clicking on a point. Users could manually adjust values as needed
  4. Add some way of flagging data not directly given in papers. For reproducability, this would generally be useful for data that was extracted by digitizing as well as other data manipulation procedures (unit conversions, obtained from other summary statistics, etc.)

Unsure how feasible this is to setup for a PDF upload in the Extraction module, but could be very valuable if this request is reasonable

3 years ago

I’d only heard of these before, very cool! Can I ask how you’d use the extracted data in the course of a meta-analysis? NK currently only consumes descriptive statistics, so the best I imagine we can do is compute e.g. means & SDs from the extracted data sets. Maybe compute correlation coefficients, in a future where we use that data.

Maybe I’m being a complete square, lmk your ideas!

3 years ago

Here is an example of how this has been used in the past:

Author reports mortality rates for two different patient groups as percentages in a bar chart, but the exact values are not given in the figure. All we have are the bars and sample sizes for both groups and it is hard to tell what the actual percentages are from visually inspecting. By calibrating the image using digitize (i.e., specifying values and locations of the minimum and maximum x/y coordinates on a figure), you can click on the top of the bars on the bar chart and find a very close approximate of the percentages. From these approximates, you can typically infer the actual percentage given that the sample size is known. From this, we would obtain event counts that are amenable for the Extraction module.

Other cases might be finding values for medians/IQRs from a boxplot or obtaining surival rate (and manually converting to event counts) from a specific timepoint on a survival curve. Overall, digitize is a pretty flexible tool and could be useful for gathering data from many other types of figures, as long as it is an x/y type graph with numeric values on at least one of the axes.

Thanks for considering!

3 years ago
1

Got it! This would be a great addition to table data extraction that we’re hoping to build in our NLM phase II. To be transparent, this probably isn’t something we’ll be focused on in the next couple of months due to bigger fish to fry (RoB and dual extraction taking priority), although I think this is a great feature.

3 years ago

No problem, I definitely understand that there is a lot that needs to come first!

3 years ago
Changed the status to
Under Consideration
3 years ago