Recovering numeric data from an image of a graph

Sometimes all you have is an image file of a data graph (perhaps from a scanned publication) and you would like to convert the data points to xy coordinates (maybe you want to try an analysis on someone else’s data). R gives some tools to help. Here is the start of some code (you need to save the image file as a PNG file; if you don’t have software to convert/save images in PNG format, there are many free programs available online including Photoshop, GIMP, Krita, PhotoPea (an online editor that is very photoshop-like), and programs designed as viewers that also handle conversions with aplomb – Xnview, Irfanview etc – a web search will get you to these programs… :

Note there is another variant of this code towards the bottom of the post that is probably better, depending on your needs.

##### VERSION 1 -----

# load library with loadPNG() function 
# (install this by un-commenting the next line if it isn't already on your system)
# install.packages("png")
library(png)

# make a new plot window
plot.new()

# read in an image file
img=readPNG("fig.png")

# paint the image onto the plot area.
rasterImage(img,0,0,1,1)

# use the locator() function to get the xy coordinates of points 
# that you click on the plot.  At this point R goes into interactive mode. 
#in RStudio the top of the plot window will show "Locator active (Esc to finish) 
out=locator(100,"p")
# click on the points (here the maximum points to collect is set at 100, 
# and "p" says draw a circle on the plot where ever you click. 
# Each time you click the xy coordinates are added to the variable out.
# when you are done, press the Esc key.

# now show the values you collected. In this case I collected 3 clicks.
#
out
$x
[1] 0.1729519 0.1744344 0.1877773

$y
[1] 0.7589416 0.3948941 0.5566592

If you start your clicking with clicks on tick marks at each end of the y axis and x axis you will then have reference points to allow you to map the subsequent xy coordinates from pixels down and across from the top left, into whatever units the graph uses.

If you have a graph with discrete points on it, instead of the locator() function you can use the identify() function which will look close to where you click for a plot symbol, and return the coordinates of that plot symbol, so you don’t need to be so precise with your clicks – R will do the fine adjusting for you.

In the example code above the output shows 3 x coordinates and 3 matching y coordinates. These represent the x-y coordinates of the points in the order you clicked them (so it pays to be systematic in the order you click them). My approach is to paste these numbers into excel – the x and y coordinates will paste as a text string into a single cell. Use the tools>>text to columns menu item and split the text into columns using delimeter “space”. You should now have x and y coordinates of the points running across the columns from left to right. Note that these numbers are the position on the original image with the top left point being (0,0) in this coordinate system and the bottom right (1,1)

 You might prefer copy and paste-special-transpose to get the data into x and y columns:

 Now, let’s just assume that the first two coordinates represent the Y axis 100 and 0 respectively, you can use simple arithmetic to generate the corresponding y values for all the points below (same principle applies for x values if you have reference coordinates for the x axis)

A simple calculation as above will give you the Y value calculated from the Y-axis data. The relevant formula is shown in text in the cell D7. Note the use of $C$5 etc so you can copy the formula down to convert the XY locations to more cells below (which would be the usual case. If you have an xy scatterplot and want to interpolate the x values to go with the y values, you can make the same sort of calculations once you have chosen relevant points on the x-axis to serve as reference values for the calculation.

Note that the R code above is just a stub. It would be straight forward to do the calculations in R if you wanted to, and even pass the resulting interpolated values to relevant statistical or graphing routines within R. R has has some very powerful and customisable graphics capabilities far beyond the primitive stuff in Excel. But I will leave that to the individual reader so they can customise the code to suit their needs. There are lots of guides and tutorials on making beautiful graphs in R.

Another update that simplifies the interpolation… a modified script that probably speeds thing is below, with explanation following.

##### VERSION 2 - automatic interpolation so long as you edit the appropriate bits of this code and trim the graph image to suit -----

# load library with loadPNG() function 
# (install this by un-commenting the next line if it isn't already on your system)
# install.packages("png")
library(png)

# make a new plot window
plot.new()

plot(x=c(0,18),y=c(0,4),xlim=c(0,18),ylim=c(0,4))

# read in an image file
img=readPNG("testgraph-crop.png")

# paint the image onto the plot area.
rasterImage(img,0,0,18,4,interpolate=TRUE)

# use the locator() function to get the xy coordinates of points 
# that you click on the plot.  At this point R goes into interactive mode. 
#in RStudio the top of the plot window will show "Locator active (Esc to finish) 
out=locator(100,"p")
# click on the points (here the maximum points to collect is set at 100, 
# and "p" says draw a circle on the plot where ever you click. 
# Each time you click the xy coordinates are added to the variable out.
# when you are done, press the Esc key.

# now show the values you collected. In this case I collected 3 clicks.
#
out

OK this is crude test code modifications I used with the following graph

Note the Y axis covers 0-4 and X covers 0-18. Crop the graph to just the (x,y) 0,0 to18,4 and use this cropped graph to import into the script.

The revised code first plots a dummy graph with the same x and y range. I have bolded the bits you will need to change to suit the s and y range of your own graphs:

plot(x=c(0,18),y=c(0,4),xlim=c(0,18),ylim=c(0,4))

it then reads in the cropped graph image and overlays it onto the plot area:

# read in an image file
img=readPNG("testgraph.png")

# paint the image onto the plot area.
rasterImage(img,0,0,18,4,interpolate=TRUE)

Now, when you click on points, R will interpolate and the x,y values you get should be close to the expected values from the original axes.

Let me know if you have problems.

🙂 Geoff