Knowing language syntax is one thing. Really knowing a language means taking on challenges. But what challenges?
Here are some ideas. Presuming that you’re interested in data science, you need to be good at procuring, preparing and analyzing data. Let’s look at the procuring part.
What to work on?
Do you know your “why”? Are you looking at data science because it’s been called “sexy”? Do you just want to make a living? Or are you really animated by the possibilities? Know yourself – don’t start a career that might make you miserable just because it is hyped.
Let’s see if any of the following interest you.
Maybe you’re a football fan (not soccer, or metric football as some of us call it). How about some player stats for your fantasy league? You can bet that NFL teams are using statistics to improve their results.
Alright soccer fans, here’s something for you. There may well be far better sources – I just found these from some casual googling with “download <sport> stats”.
Maybe you’re more interested in human welfare and poverty. How about Gapminder? The late Hans Rosling did some terrific work, like the best stats you’ve ever seen. His recent book is well regarded by some very influential people.
What about crime?
All of the above are free currently. If you can pay, or meet various eligibility criteria as a legitimate researcher, many more are available.
Some financial information on Quandl is free.
Do you like movies? Here are movie reviews from Amazon.
I wonder – what did Facebook use to train its system to recognize pictures to filter out?
Other sites have assembled lists of good data sources. The most extensive one may well be at KDNuggets, which is terrific for all sorts of data science issues and is a permanent link here.
This post could go on forever, and eventually there will be a dedicated data page here on this site. But the point of this particular post is to get you something to work on to develop your coding and analysis skills. So let’s work on it.
What to do with the data?
A lot of data science work is nothing but reading, cleaning and manipulating data. You might not know what to do with data yet, but you can prep it for the people who do, so get good at this so you can apprentice with the people who do the advanced analyses. And in doing so you can develop your SAS, R, Python, command line and other coding skills.
Specifically, you want to how to:
- Find and download data sources.
- Read whatever data you find in whatever form.
- Automate these processes and deal with problems that come up.
- Process, filter and join the read data into forms tidy enough to support further analysis.
If you don’t have your own ideas…
Here are courses in R, Python and SAS from Coursera that can help.
To learn and practice R, try the Johns Hopkins data science program via Coursera. You’ll be installing and learning R and also learning other skills you’ll be using regularly.
For SAS, Coursera offers a class for beginners and one with more advanced statistics. You access SAS either by setting up a virtual machine (requiring a local installation) or by using the SAS Academic Edition (via a browser). The courses are here. These are not as extensive as the R and Python courses above, but SAS has only recently begun on Coursera and I think more is coming.
Last I knew the you could take the courses above for free, but expect to pay if you want to get documented certifications and grading. Incidentally, I have no commercial tie to Coursera and in fact pay for their services (although they’re welcome to give promotional consideration…). There are other sources, I’m just not as familiar with them.
Enough reading – let’s practice our code!