Intro
Store Everything
Use Existing Data
Connect Datasets
Effective data science is about using a range of datasets, connecting the dots between one set of data and another, such as predicting restaurant health scores based on Yelp reviews. In machine learning speak: it’s often better to collect more features rather than spend days optimizing hyperparameters.
Anything Can Be Quantified
A spreadsheet about sewer overflows is clearly data to most people, but what about a calendar? At first, a calendar might not seem like the sort of data that you analyze with statistics. However, you can also represent a calendar as a spreadsheet and as a graph.
Data science becomes a creative endeavor when peeling away the obvious variables presented to you. Maybe you have a bunch of PDF documents. You could easily extract the text in the PDFs and search through the content. Depending on the problem you are solving, these files hold more interesting information than just the text. You can get the page count, the file size, and the shapes of the pages and the program that created it. There is information hidden in many datasets that goes beyond what’s immediately obvious.
There is a lot of talk about the difference between different kinds of data. There’s “qualitative” vs. “quantitative” and “unstructured” vs. “structured.” To me, there isn’t much difference between “qualitative” and “quantitative” data, nor is there between “unstructured” and “structured” data because I know that I can convert between the different types.
At first, the registration papers of company might not seem like interesting data. They begin as paper, most of the fields are text, and the formats aren’t particularly standardized. But when you put them in a database in a machine-readable format, qualitative data becomes quantitative data that can be used to supplement other data sources.
Send Boring Work to Robots
Data collection is a prime example of a task that should be automated. A common scene in university research labs is swaths of grad students handing out paper questionnaires to participants of studies. The data scientist says: collect the data automatically and unobtrusively, using existing systems whenever possible. The supercomputers we carry in our pocket are a great place to start.
This mindset can be applied not only to the data, but also to the process itself. Rather than learning and remembering your entire analysis process, you can write a program that does the whole thing for you, from the original acquisition of the data, to the modeling, to the presentation of results to another person. By making everything a program, you make it easier to find mistakes, to update your analyses, and reproduce your results.
Tools
Properly discussing these relevant tools is another post (maybe a book), but here’s one thought. While it always helps to have more education, you don’t need a PhD in math or computer science in order to create useful things. Loads of wonderful algorithms have already been implemented for you, and simple algorithms often work quite well. If you’re just getting started, focus on the “plumbing” that connects different datasets and systems together.
Data Science Mindset at Zipfian Academy
In the end, it’s not about the newest, trendiest framework or fastest data analysis platform. It’s about finding interesting insights from your data and sharing it with the world. Start small, get your hands dirty, and have fun!