Data Science Central blog pointed out an interesting discussion on Quora:

The blog post summaries the lengthy discussion on Quora as follows:

Here’s a summary of the very long and detailed top answer:

Learn about matrix factorizationsLearn about distributed computingLearn about statistical analysisLearn about optimizationLearn about machine learningLearn about information retrievalLearn about signal detection and estimationMaster algorithms and data structuresPracticeStudy EngineeringAll the numerous other answers go along the same lines. We strongly disagree with this – in the sense that these posters miss 50% of what makes a real data scientist: business acumen, domain expertize, craftsmanship and tricks of the trade, data vision (both metaphorically and literally), leadership, communication skills, vendor selection, consulting skills, and expertize in finding data sets (not just insights) and metrics. Also, I believe matrix factorizations and some other stuff (eigenvalues) are not part of modern data science anymore. These answers by young very smart educated people illustrate the mismatch between what hiring managers are looking for, and what potential hires think they should learn (reinforced by university curricula) to become a data scientist.

The top answer from the discussion is reproduced convenience. You should checkout the complete discussion on Quora:

Strictly speaking, there is no such thing as “data science” (see What is data science? ). See also: Vardi, Science has only two legs: http://portal.acm.org/ft_gateway…

Here are some resources I’ve collected about working with data, I hope you find them useful (note: I’m an undergrad student, this is not an expert opinion in any way).

1)

Learn about matrix factorizations

- Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numerical Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix decomposition algorithms are fundamental to many data mining applications and are usually underrepresented in a standard “machine learning” curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout [1] are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites. I’d recommend these resources for self study/reference material:
- See Jack Dongarra : Courses and What are some good resources for learning about numerical analysis?
2)

Learn about distributed computing

- It is important to learn how to work with a Linux cluster and how to design scalable distributed algorithms if you want to work with big data (Why the current obsession with “big” data? ).
- Crays and Connection Machines of the past can now be replaced with farms of cheap cloud instances, the computing costs dropped to less than $1.80/GFlop in 2011 vs $15M in 1984: http://en.wikipedia.org/wiki/FLOPS .
- If you want to squeeze the most out of your (rented) hardware it is also becoming increasingly important to be able to utilize the full power of multicore (see http://en.wikipedia.org/wiki/Moo… )
- Note: this topic is not part of a standard Machine Learning track but you can probably find courses such as Distributed Systems or Parallel Programming in your CS/EE catalog.

- See Indranil Gupta : UIUC Home Page#teaching and What are some good resources for learning about distributed computing? Why?
3)

Learn about statistical analysis

- Start learning statistics by coding with R: What are essential references for R? and experiment with real-world data: Data: Where can I find large datasets open to the public?
- Cosma Shalizi compiled some great materials on computational statistics, check out his lecture slides, and also Statistical science: What are some good resources for learning about statistical analysis?

- You’ve got to love your data. To read about the inspired use of statistics check out Nate Silver‘s book where in chapter 3 he talks about PECOTA and his love affair with baseball predictions: The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t: Nate Silver: Amazon.com: Kindle Store. Also see the story of W. S. Gosset applying his statistical knowledge for great good: http://www.beeronomics.org/paper…
4)

Learn about optimization

- This subject is essentially prerequisite to understanding many Machine Learningand Signal Processing algorithms, besides being important in its own right.
- Start with Stephen P. Boyd‘s video lectures and also Mathematical Optimization: What are some good resources to learn about optimization?
5)

Learn about machine learning

Beforeyou get to think about algorithms look carefully at the data and select all the relevant features to include in your model. See this talk by Jeremy Howard : At Kaggle, It’s a Disadvantage To Know Too Much- Also see What are some good resources for learning about machine learning? Why? and Large Scale Learning: What are some introductory resources for learning about large scale machine learning? Why?

- Statistics vs. machine learning, fight!: http://brenocon.com/blog/2008/12…
- You can structure your study program according to online course catalogs

and curricula of MIT, Stanford or other top schools. Experiment with

data a lot, hack some code, ask questions, talk to good people, set up a web crawler in your garage: The Anatomy of a Search Engine- You can join one of these startups and learn by doing: What startups are hiring engineers with strengths in machine learning/NLP?
- The alternative (and rather expensive) option is to enroll in a CS

program/Machine Learning track if you prefer studying in a formal

setting. See: Graduate School: What makes a Master’s in Computer Science (MS CS) degree worth it and why?- Try to avoid overspecialization. The breadth-first approach often works best when learning a new field and dealing with hard problems, see the Second voyage of HMS Beagle on the adventures of an ingenious young data miner.
6)

Learn about information retrieval

- Machine learning Is not as cool as it sounds: http://teddziuba.com/2008/05/mac…
- Information Retrieval: What are some good resources to get started with Information Retrieval? Why?
7)

Learn about signal detection and estimation

- This is a classic topic and “data science” par excellence in my opinion.

Some of these methods were used to guide the Apollo mission or detect

enemy submarines and are still in active use in many fields. This is

often part of the EE curriculum.- Start with Robert F. Stengel’ lecture slides on optimal control and estimation: Rob Stengel’s Home Page and Alan V. Oppenheim’s
mit.edu- What are some good resources for learning about signal estimation and detection?
8)

Master algorithms and data structures9)

Practice

- Carpentry: http://software-carpentry.org/
- Programming Challenges: What are some good “toy problems” in data science?
- Tools: Which are some of the best Data Analysis tools?
- Data: Where can I find large datasets open to the public?
If you do decide to go for a Masters degree:

10)

Study EngineeringI’d go for CS with a focus on either IR or Machine Learning or a combination of both and take some systems courses along the way. As a “data scientist” you will have to write a ton of code and probably develop distributed algorithms/systems to process massive amounts of data. MS in Statistics will teach you how to do modeling and regression analysis etc, not how to build systems, I think the latter is more urgently needed these days as the old tools become obsolete with the avalanche of data. There is a shortage of engineers who can build a data mining system from the ground up. You can pick up statistics from books and experiments with R (see item 3 above) or take some statistics classes as a part of your CS studies.

Good luck.

[1] http://mahout.apache.org/

[2] http://www.netlib.org/lapack/

[3] http://www.netlib.org/eispack/

[4] http://math.nist.gov/javanumeric…

[5] http://www.netlib.org/scalapack/

[6] http://labs.google.com/papers/ma…

## Leave a Reply