Google Track

Showing posts with label modelling. Show all posts
Showing posts with label modelling. Show all posts

Tuesday, September 17, 2013

The Data Science Mindset

Intro 

Names like ‘R’, ‘SQL’, and ‘D3’ make data science seem more like alphabet soup than a deliberate practice of working with data. It’s so easy to get lost in the sea of acronyms, packages, and frameworks that we often find our students prematurely optimizing for the right toolset to use, unable to move forward until they have researched every available option. In reality, data science isn’t just about the tools. It’s a mindset: a way of looking at the world. It’s about taking advantage of our modern computers and all of the information that they’re already collecting to study how things work and push the limits of human knowledge just a little bit further. We have a favorite saying around here — data is everything and everything is data. If we begin with this mindset, a lot of data science approaches naturally follow.
 

Store Everything

Storage is cheap. Collect everything and ask questions later. Store it in the rawest form that is convenient, and don’t worry about how or even when you’re going to analyze it. That part comes later.

Use Existing Data

We’re already storing data — let’s use it. When faced with questions, data scientists regularly adapt the query so that it can be approximately answered with an existing and convenient dataset. The best part of data science is discovering surprising applications of existing stores of data. For example, there is a plethora of satellite imagery of Earth. We can use this data to learn about fertilizer use in Uganda, or use pictures of the Earth at night to estimate rural electrification in developing countries.

Connect Datasets

We’re storing everything, all over the world, inexpensively, for the first time in history. There are many lessons to be learned by utilizing more of this treasure trove. Don’t worry about making the best use out of a single source of data. Focus on connecting disparate datasets rather than tuning your models. Conventional statistics teaches a lot about how to choose analysis methods that are appropriate for your data collection approach and how to tune the models for a specific dataset.
Effective data science is about using a range of datasets, connecting the dots between one set of data and another, such as predicting restaurant health scores based on Yelp reviews. In machine learning speak: it’s often better to collect more features rather than spend days optimizing hyperparameters.

Anything Can Be Quantified

Our culture loves to quantify. If you can turn it into a number, that number can be put into a table. Importantly, that table can now be processed by a computer.
A spreadsheet about sewer overflows is clearly data to most people, but what about a calendar? At first, a calendar might not seem like the sort of data that you analyze with statistics. However, you can also represent a calendar as a spreadsheet and as a graph.




Data science becomes a creative endeavor when peeling away the obvious variables presented to you. Maybe you have a bunch of PDF documents. You could easily extract the text in the PDFs and search through the content. Depending on the problem you are solving, these files hold more interesting information than just the text. You can get the page count, the file size, and the shapes of the pages and the program that created it. There is information hidden in many datasets that goes beyond what’s immediately obvious.
There is a lot of talk about the difference between different kinds of data. There’s “qualitative” vs. “quantitative” and “unstructured” vs. “structured.” To me, there isn’t much difference between “qualitative” and “quantitative” data, nor is there between “unstructured” and “structured” data because I know that I can convert between the different types.
At first, the registration papers of company might not seem like interesting data. They begin as paper, most of the fields are text, and the formats aren’t particularly standardized. But when you put them in a database in a machine-readable format, qualitative data becomes quantitative data that can be used to supplement other data sources.

Send Boring Work to Robots

We no longer live in an era where “computer” refers to someone who carries out calculations. Find yourself doing something over and over? Give it to the bots. As far as data analysis goes, modern computers can be far more effective at rote tasks, such as drawing new graphs with every update of a dataset.
Data collection is a prime example of a task that should be automated. A common scene in university research labs is swaths of grad students handing out paper questionnaires to participants of studies. The data scientist says: collect the data automatically and unobtrusively, using existing systems whenever possible. The supercomputers we carry in our pocket are a great place to start.
This mindset can be applied not only to the data, but also to the process itself. Rather than learning and remembering your entire analysis process, you can write a program that does the whole thing for you, from the original acquisition of the data, to the modeling, to the presentation of results to another person. By making everything a program, you make it easier to find mistakes, to update your analyses, and reproduce your results.

Tools

Once inside the data science mindset, solving interesting problems becomes a function of data acquisition and processing. Computers can fit models and make predictions about datasets that are too big to wrap your head around and convert paper documents into electronic tables. They probably know more about you and your habits than you know yourself! Use the tools available to you, but don’t get caught up on the tools themselves.
Properly discussing these relevant tools is another post (maybe a book), but here’s one thought. While it always helps to have more education, you don’t need a PhD in math or computer science in order to create useful things. Loads of wonderful algorithms have already been implemented for you, and simple algorithms often work quite well. If you’re just getting started, focus on the “plumbing” that connects different datasets and systems together.

Data Science Mindset at Zipfian Academy

Our course teaches many data science tools, but we also teach the data science mindset, because you need both to be a great data scientist. To this end, we organize our 12-week course by projects — such as a recommendation engine or spam filter — rather than software packages or algorithms. We teach the various tools in context of applied projects so students learn how to choose the appropriate tool and how to build the plumbing that connects them.
In the end, it’s not about the newest, trendiest framework or fastest data analysis platform. It’s about finding interesting insights from your data and sharing it with the world. Start small, get your hands dirty, and have fun!

Tuesday, May 15, 2012

Effective big data strategies detailed


Businesses beginning a big data analytics program with advanced business intelligence software may be concerned about affordability. However, according to PC Advisor, there are easy steps that companies can take to make sure that their deployments are successful. These steps include deep research into the business case at hand and prudent financial planning.



Careful planning


"[Big data is] new technology solving a business problem that we often haven't proved. That's important for CIOs to keep in mind," financial consultant Jeff Muscarella told PC Advisor. "The business is going to be coming to them with all sorts of half-baked ideas for what they can do with Big Data. They have to ask: Will it really drive revenue? How and for how long?"
According to the source, carefully vetting business ideas for new big data projects is vitally important for CIOs trying to save money on their big data projects. Gathering details on each projected usage of data means less chance of failure. The source urged companies to target their big data projects, to fire "bullets" rather than "cannons" at specific problems that can provide value for the company. Muscarella told the source that companies can start small to prove that a process works before moving to the company-wide infrastructure level.



Myths vs. reality


As a widely hyped technology often presented as the future of business intelligence, advanced analytics and big data have received a large amount of press. To avoid business confusion, several publications have offered clarifications of what the technology can and cannot offer companies. The Economic Times stated that any business with a product to sell and any company hoping to help make up the potential market for big data. As companies begin to harness the power of big data, competitors could take of the systems in a bid to compete on an even level.
The source sought to puncture myths about what big data can and cannot do. It stated that big data's endgame is unknown, and that many of the features of big data analytics are still spoken of in the future tense. The source found that companies can already use the technology to provide "amazing" customer insights from vast quantities of "irrelevant stuff." While it is important to be careful when integrating big data, making the effort could become a required part of business strategy

Industry News from: http://www.panorama.com/industry-news/article-view.html?name=Effective-big-data-strategies-detailed-774397&utm_source=dlvr.it&utm_medium=facebook

Tuesday, May 8, 2012

Microsoft Predictive Analytics


Predictive analytics is the next step in BI: not only can you be retrospective and see what has happened in your company in the past, but now we can distill new information from the old information to actually predict what will happen in the future. Jamie MacLennan, CTO of Predixion Software, explains the difference between business intelligence and predictive analytics and shares a program that Predixion has created in Excel to review the Practice Fusion data.



Featuring Bruno Aziza

Tuesday, November 9, 2010

Information Managment Concepts

Following the behavioral science theory of management, mainly developed at Carnegie Mellon University and prominently represented by Barnard, Richard M. Cyert, March and Simon, most of what goes on in service organizations is actually decision making and information processes. The crucial factor in the information and decision process analysis is thus individuals’ limited ability to process information and to make decisions under these limitations.

According to March and Simon [1], organizations have to be considered as cooperative systems with a high level of information processing and a vast need for decision making at various levels. They also claimed that there are factors that would prevent individuals from acting strictly rational, in opposite to what has been proposed and advocated by classic theorists

Instead of using the model of the economic man, as advocated in classic theory, they proposed the administrative man as an alternative based on their argumentation about the cognitive limits of rationality.

While the theories developed at Carnegie Mellon clearly filled some theoretical gaps in the discipline, March and Simon [1] did not propose a certain organizational form that they considered especially feasible for coping with cognitive limitations and bounded rationality of decision-makers. Through their own argumentation against normative decision-making models, i.e., models that prescribe people how they ought to choose, they also abandoned the idea of an ideal organizational form.

In addition to the factors mentioned by March and Simon, there are two other considerable aspects, stemming from environmental and organizational dynamics. Firstly, it is not possible to access, collect and evaluate all environmental information being relevant for taking a certain decision at a reasonable price, i.e., time and effort [2]. In other words, following a national economic framework, the transaction cost associated with the information process is too high. Secondly, established organizational rules and procedures can prevent the taking of the most appropriate decision, i.e., that a sub-optimum solution is chosen in accordance to organizational rank structure or institutional rules, guidelines and procedures [3] [4], an issue that also has been brought forward as a major critique against the principles of bureaucratic organizations.[5]

According to the Carnegie Mellon School and its followers, information management, i.e., the organization's ability to process information, is at the core of organizational and managerial competencies. Consequently, strategies for organization design must be aiming at improved information processing capability. Jay Galbraith [6] has identified five main organization design strategies within two categories — increased information processing capacity and reduced need for information processing.

1.Reduction of information processing needs
1.Environmental management
2.Creation of slack resources
3.Creation of self-contained tasks
2.Increasing the organizational information processing capacity
1.Creation of lateral relations
2.Vertical information systems
Environmental management. Instead of adapting to changing environmental circumstances, the organization can seek to modify its environment. Vertical and horizontal collaboration, i.e. cooperation or integration with other organizations in the industry value system are typical means of reducing uncertainty. An example of reducing uncertainty in relation to the prior or demanding stage of the industry system is the concept of Supplier-Retailer collaboration or Efficient Customer Response.

Creation of slack resources. In order to reduce exceptions, performance levels can be reduced, thus decreasing the information load on the hierarchy. These additional slack resources, required to reduce information processing in the hierarchy, represent an additional cost to the organization. The choice of this method clearly depends on the alternative costs of other strategies.

Creation of self-contained tasks. Achieving a conceptual closure of tasks is another way of reducing information processing. In this case, the task-performing unit has all the resources required to perform the task. This approach is concerned with task (de-)composition and interaction between different organizational units, i.e. organizational and information interfaces.

Creation of lateral relations. In this case, lateral decision processes are established that cut across functional organizational units. The aim is to apply a system of decision subsidiarity, i.e. to move decision power to the process, instead of moving information from the process into the hierarchy for decision-making.

Investment in vertical information systems. Instead of processing information through the existing hierarchical channels, the organization can establish vertical information systems. In this case, the information flow for a specific task (or set of tasks) is routed in accordance to the applied business logic, rather than the hierarchical organization.

Following the lateral relations concept, it also becomes possible to employ an organizational form that is different from the simple hierarchical information. The Matrix organization is aiming at bringing together the functional and product departmental bases and achieving a balance in information processing and decision making between the vertical (hierarchical) and the horizontal (product or project) structure. The creation of a matrix organization can also be considered as management's response to a persistent or permanent demand for adaptation to environmental dynamics, instead of the response to episodic demands.

Source: Wikipedia