Google Track

Tuesday, September 17, 2013

The Data Science Mindset

Intro 

Names like ‘R’, ‘SQL’, and ‘D3’ make data science seem more like alphabet soup than a deliberate practice of working with data. It’s so easy to get lost in the sea of acronyms, packages, and frameworks that we often find our students prematurely optimizing for the right toolset to use, unable to move forward until they have researched every available option. In reality, data science isn’t just about the tools. It’s a mindset: a way of looking at the world. It’s about taking advantage of our modern computers and all of the information that they’re already collecting to study how things work and push the limits of human knowledge just a little bit further. We have a favorite saying around here — data is everything and everything is data. If we begin with this mindset, a lot of data science approaches naturally follow.
 

Store Everything

Storage is cheap. Collect everything and ask questions later. Store it in the rawest form that is convenient, and don’t worry about how or even when you’re going to analyze it. That part comes later.

Use Existing Data

We’re already storing data — let’s use it. When faced with questions, data scientists regularly adapt the query so that it can be approximately answered with an existing and convenient dataset. The best part of data science is discovering surprising applications of existing stores of data. For example, there is a plethora of satellite imagery of Earth. We can use this data to learn about fertilizer use in Uganda, or use pictures of the Earth at night to estimate rural electrification in developing countries.

Connect Datasets

We’re storing everything, all over the world, inexpensively, for the first time in history. There are many lessons to be learned by utilizing more of this treasure trove. Don’t worry about making the best use out of a single source of data. Focus on connecting disparate datasets rather than tuning your models. Conventional statistics teaches a lot about how to choose analysis methods that are appropriate for your data collection approach and how to tune the models for a specific dataset.
Effective data science is about using a range of datasets, connecting the dots between one set of data and another, such as predicting restaurant health scores based on Yelp reviews. In machine learning speak: it’s often better to collect more features rather than spend days optimizing hyperparameters.

Anything Can Be Quantified

Our culture loves to quantify. If you can turn it into a number, that number can be put into a table. Importantly, that table can now be processed by a computer.
A spreadsheet about sewer overflows is clearly data to most people, but what about a calendar? At first, a calendar might not seem like the sort of data that you analyze with statistics. However, you can also represent a calendar as a spreadsheet and as a graph.




Data science becomes a creative endeavor when peeling away the obvious variables presented to you. Maybe you have a bunch of PDF documents. You could easily extract the text in the PDFs and search through the content. Depending on the problem you are solving, these files hold more interesting information than just the text. You can get the page count, the file size, and the shapes of the pages and the program that created it. There is information hidden in many datasets that goes beyond what’s immediately obvious.
There is a lot of talk about the difference between different kinds of data. There’s “qualitative” vs. “quantitative” and “unstructured” vs. “structured.” To me, there isn’t much difference between “qualitative” and “quantitative” data, nor is there between “unstructured” and “structured” data because I know that I can convert between the different types.
At first, the registration papers of company might not seem like interesting data. They begin as paper, most of the fields are text, and the formats aren’t particularly standardized. But when you put them in a database in a machine-readable format, qualitative data becomes quantitative data that can be used to supplement other data sources.

Send Boring Work to Robots

We no longer live in an era where “computer” refers to someone who carries out calculations. Find yourself doing something over and over? Give it to the bots. As far as data analysis goes, modern computers can be far more effective at rote tasks, such as drawing new graphs with every update of a dataset.
Data collection is a prime example of a task that should be automated. A common scene in university research labs is swaths of grad students handing out paper questionnaires to participants of studies. The data scientist says: collect the data automatically and unobtrusively, using existing systems whenever possible. The supercomputers we carry in our pocket are a great place to start.
This mindset can be applied not only to the data, but also to the process itself. Rather than learning and remembering your entire analysis process, you can write a program that does the whole thing for you, from the original acquisition of the data, to the modeling, to the presentation of results to another person. By making everything a program, you make it easier to find mistakes, to update your analyses, and reproduce your results.

Tools

Once inside the data science mindset, solving interesting problems becomes a function of data acquisition and processing. Computers can fit models and make predictions about datasets that are too big to wrap your head around and convert paper documents into electronic tables. They probably know more about you and your habits than you know yourself! Use the tools available to you, but don’t get caught up on the tools themselves.
Properly discussing these relevant tools is another post (maybe a book), but here’s one thought. While it always helps to have more education, you don’t need a PhD in math or computer science in order to create useful things. Loads of wonderful algorithms have already been implemented for you, and simple algorithms often work quite well. If you’re just getting started, focus on the “plumbing” that connects different datasets and systems together.

Data Science Mindset at Zipfian Academy

Our course teaches many data science tools, but we also teach the data science mindset, because you need both to be a great data scientist. To this end, we organize our 12-week course by projects — such as a recommendation engine or spam filter — rather than software packages or algorithms. We teach the various tools in context of applied projects so students learn how to choose the appropriate tool and how to build the plumbing that connects them.
In the end, it’s not about the newest, trendiest framework or fastest data analysis platform. It’s about finding interesting insights from your data and sharing it with the world. Start small, get your hands dirty, and have fun!