Build an e-commerce plot in R

Recently I did some research on the state of e-commerce in Poland, which was necessary for a position I applied for.  While the project did not require presentation of any statistical or numeric data, I figured out it would be nice to attach a simple plot portraying how the field has changed over the past several years.  The following is a record of my struggle.

1. Find data

You need to draw a plot about development of e-commerce in a country, and you need to do it fast.  You glide through Alexa and other sites gathering web traffic data, frantically looking for an option of exporting data pertaining to individual domains as a CSV file.  After a while, with some resignation, you turn to Google Trends as your possible saviour.  With a few click of the mouse, you export individual files pertaining to search for five major internet shops in game.

You look at the files.  You merge the different pieces of data.  You convert integers to integers and dates to dates.  You change the format from short to long with tidyr.  You draw a plot…

E-shops in Poland plot
E-shops popularity in Poland, according to Google data

It is beautiful.  You cry as if your child was born.  You go back to Google Trends to confirm your data resembles what is presented there.  Well, it does not.

2. Realize your data is wrong. Cry.

Google Trends, when checking individual keywords, is presenting the search rank for the keywords relatively to the popularity of that individual keyword, with the maximum value, 100, being the individual keyword’s maximum popularity.  Had you exported a merged file comparing popularity of all the keywords, your plot would show high popularity of the top one, and the rest would remain close to zero.  You can correct the plot with the corrected data, but it would be useless—the only thing it would convey would be the popularity of the most popular platform.  You despair.

3. Despair. Look for more data.

You frantically search through Alexa and other sites gathering web traffic data again.  You stumble upon a website gathering data on local e-commerce.  You notice they publish a ranking of most popular e-commerce websites in the country you are researching.  They publish them every month.

You realize that you will need to copy relevant data from several years of monthly reports.  You cry more.  Then you start writing them up in a CSV file.  You choose four domains which appear in the ranking fairly regularly, and track them down in the reports, month after month.  You notice that there is no reports for several months in 2016, but it is too late to worry about it.

4. Analyze the data. Again.

You have the data, and now it is arranged exactly the way you needed. You can plot it almost immediately.

You have a graph.  It doesn’t look perfect.  You included both lines and points to make it obvious that there is lack of data for several months in 2016—otherwise it just looks strange.  You put it in the report, and hope it works.

Popularity of shopping platforms
Popularity of shopping platforms


Hey, so you read this far!  Thank you for bearing with my literary experiment, and sorry if it was annoying.  I wanted to upload all the data used in this post to my GitHub account, but significant portion of it is in Polish (you can see it in the code, in the parts where I refer to column names).  Anyway, I bet you can find some better source of traffic data that would work for you.

Subscribe if you’d like to read more about my fantastic adventures in the world of Pandas and R!

Leave a Reply