Psst! Wanna look at some COVID-19 data?

Maybe you’ve heard enough about this lately. But if you’re trying to learn data science, whether with Python or R, this is a chance to do it with something very timely.

The European Centre for Disease Prevention and Control publishes new data daily here in csv, xml and json format. It also includes some R code for reading it.

This site has some basic Python code to help you get the data and put it into a data frame with pandas. If you don’t know any pandas, it’s a great way to work with tabular data in Python.

So knock yourself out, and let me know if you find something interesting.

COVID-19 model with source

It’s true that all models are wrong, but some are useful. If nothing else about this model is useful, you can see an example of what can be done with R and Shiny.

Click around on the site and you’ll even find a Github link to the source code. Alright, here it is:

More cheap books, from Packt

Packt is offering two ways to get your cheap book fix.

Mega-Bundles consist of books selected around a theme, 15 for $50.

Build your own bundles are 10 books for $40. There is a great variety to choose from, including everything this blog discusses and much more besides.

Some of the books are shortish – 100 pages or so. You can see the page counts for the individual books. Of course length and quality are two different things.

Time may be short, so check these out soon. If you recommend particular books, leave a comment.

Free “Jumpstart with R” course (but not for long)

There are many sources of course work for R and other data science topics. I haven’t tried Business Science but it’s hard to resist free. Even if you already know the stuff it can be useful to have it presented from another perspective with different examples, especially if you don’t use it very often and are subject to getting rusty.

Yeah, they’ll pitch paid products at you – they *do* have to make a living.

So check it out here.

No Starch Press books in new Humble Bundle

Get a bunch of ebooks cheap, in PDF, .mobi and .epub formats. But hustle – the promotion ends about 20 days from now, just before Christmas.

The books cover R, Python, Javascript, SQL – even F# and Haskell for you functional programming lovers. One promises Bayesian statistics “the fun way” with Star Wars, rubber ducks and LEGO – how can you possibly resist that? And for those of you who don’t foul up statistics enough on your own, there’s “Statistics Done Wrong” to help you out.

No ereader, no problem! If you’re reading this you can get one fast. The books are available as PDFs so you’re definitely covered on any machine that can read a PDF, and that’s about anything. The free Kindle app will handle the books too, with more conveniences and less obnoxious scrolling.

I usually read my ebooks on a Kindle or Nook. This gives me a nice legible screen right next to my main monitor to help me work. The Kindles will read .mobi format, Nooks like .epub. Not sure about Ipads or Google devices – haven’t tried them.

You might have to jump through some hoops to get the books loaded on the ereaders, but it can be done. I’ve done it by “sideloading”, and Kindles will let you email books below a certain size to your Kindle email account, assuming you have one.

An app like Calibre will let you convert formats and also serve as a reader on a PC or Mac, maybe even on Linux. This also makes it easier to grab code straight from the books. Yep, it’s free.

No excuses – get started!

Practice, practice!

Knowing language syntax is one thing. Really knowing a language means taking on challenges. But what challenges?

Here are some ideas. Presuming that you’re interested in data science, you need to be good at procuring, preparing and analyzing data. Let’s look at the procuring part.

What to work on?

Do you know your “why”? Are you looking at data science because it’s been called “sexy”? Do you just want to make a living? Or are you really animated by the possibilities? Know yourself – don’t start a career that might make you miserable just because it is hyped.

Let’s see if any of the following interest you.


It’s baseball season – how about some sabermetrics? Yes, baseball statistics are so well studied they even have their own name, and there are books and movies about it.

Maybe you’re a football fan (not soccer, or metric football as some of us call it). How about some player stats for your fantasy league? You can bet that NFL teams are using statistics to improve their results.

Alright soccer fans, here’s something for you. There may well be far better sources – I just found these from some casual googling with “download <sport> stats”.

Social issues

Maybe you’re more interested in human welfare and poverty. How about Gapminder? The late Hans Rosling did some terrific work, like the best stats you’ve ever seen. His recent book is well regarded by some very influential people.

What about crime?

More sites: government data, tuberculosis, the Center for Disease Control, the Guardian

All of the above are free currently. If you can pay, or meet various eligibility criteria as a legitimate researcher, many more are available.

Finance, entertainment…

Some financial information on Quandl is free.

Do you like movies? Here are movie reviews from Amazon.

I wonder – what did Facebook use to train its system to recognize pictures to filter out?


Other sites have assembled lists of good data sources. The most extensive one may well be at KDNuggets, which is terrific for all sorts of data science issues and is a permanent link here.

This post could go on forever, and eventually there will be a dedicated data page here on this site. But the point of this particular post is to get you something to work on to develop your coding and analysis skills. So let’s work on it.

What to do with the data?

A lot of data science work is nothing but reading, cleaning and manipulating data. You might not know what to do with data yet, but you can prep it for the people who do, so get good at this so you can apprentice with the people who do the advanced analyses. And in doing so you can develop your SAS, R, Python, command line and other coding skills.

Specifically, you want to how to:

  • Find and download data sources.
  • Read whatever data you find in whatever form.
  • Automate these processes and deal with problems that come up.
  • Process, filter and join the read data into forms tidy enough to support further analysis.

If you don’t have your own ideas…

Here are courses in R, Python and SAS from Coursera that can help.

To learn and practice R, try the Johns Hopkins data science program via Coursera. You’ll be installing and learning R and also learning other skills you’ll be using regularly.

For Python, check out this program from UCSD via Coursera. It assumes that you already know a little Python – if you don’t, look here. It uses Python 3.

For SAS, Coursera offers a class for beginners and one with more advanced statistics. You access SAS either by setting up a virtual machine (requiring a local installation) or by using the SAS Academic Edition (via a browser). The courses are here. These are not as extensive as the R and Python courses above, but SAS has only recently begun on Coursera and I think more is coming.

Last I knew the you could take the courses above for free, but expect to pay if you want to get documented certifications and grading. Incidentally, I have no commercial tie to Coursera and in fact pay for their services (although they’re welcome to give promotional consideration…). There are other sources, I’m just not as familiar with them.

Enough reading – let’s practice our code!