Background
The "wrangler" non-technical term is often said to derive from work done by the United States Library of Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) and their program partner the Emory University Libraries based MetaArchive Partnership. The term "mung" has roots in munging as described in theConnection to data mining
Data wrangling is a superset of data mining and requires processes that some data mining uses, but not always. The process of data mining is to find patterns within large data sets, where data wrangling transforms data in order to deliver insights about that data. Even though data wrangling is a superset of data mining does not mean that data mining does not use it, there are many use cases for data wrangling in data mining. Data wrangling can benefit data mining by removing data that does not benefit the overall set, or is not formatted properly, which will yield better results for the overall data mining process. An example of data mining that is closely related to data wrangling is ignoring data from a set that is not connected to the goal: say there is a data set related to the state of Texas and the goal is to get statistics on the residents of Houston, the data in the set related to the residents of Dallas is not useful to the overall set and can be removed before processing to improve the efficiency of the data mining process.Benefits
With an increase of raw data comes an increase in the amount of data that is not inherently useful, this increases time spent on cleaning and organizing data before it can be analyzed which is where data wrangling comes into play. The result of data wrangling can provide important metadata statistics for further insights about the data, it is important to ensure metadata is consistent otherwise it can cause roadblocks. Data wrangling allows analysts to analyze more complex data more quickly, achieve more accurate results, and because of this better decisions can be made. Many businesses have moved to data wrangling because of the success that it has brought.Core ideas
Typical use
The data transformations are typically applied to distinct entities (e.g. fields, rows, columns, data values, etc.) within a data set, and could include such actions as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering to create desired wrangling outputs that can be leveraged downstream. The recipients could be individuals, such as data architects or data scientists who will investigate the data further, business users who will consume the data directly in reports, or systems that will further process the data and write it into targets such as data warehouses, data lakes, or downstream applications.Modus operandi
Depending on the amount and format of the incoming data, data wrangling has traditionally been performed manually (e.g. via spreadsheets such as Excel), tools like KNIME or via scripts in languages such as Python or SQL. R, a language often used in data mining and statistical data analysis, is now also sometimes used for data wrangling. Data wranglers typically have skills sets within: R or Python, SQL, PHP, Scala, and more languages typically used for analyzing data. Visual data wrangling systems were developed to make data wrangling accessible for non-programmers, and simpler for programmers. Some of these also include embedded AI recommenders and programming by example facilities to provide user assistance, and program synthesis techniques to autogenerate scalable dataflow code. Early prototypes of visual data wrangling tools include OpenRefine and the Stanford/BerkeleExample
Given a set of data that contains information on medical patients your goal is to find correlation for a disease. Before you can start iterating through the data ensure that you have an understanding of the result, are you looking for patients who have the disease? Are there other diseases that can be the cause? Once an understanding of the outcome is achieved then the data wrangling process can begin. Start by determining the structure of the outcome, what is important to understand the disease diagnosis. Once a final structure is determined, clean the data by removing any data points that are not helpful or are malformed, this could include patients that have not been diagnosed with any disease. After cleaning look at the data again, is there anything that can be added to the data set that is already known that would benefit it? An example could be most common diseases in the area, America and India are very different when it comes to most common diseases. Now comes the validation step, determine validation rules for which data points need to be checked for validity, this could include date of birth or checking for specific diseases. After the validation step the data should now be organized and prepared for either deployment or evaluation. This process can be beneficial for determining correlations for disease diagnosis as it will reduce the vast amount of data into something that can be easily analyzed for an accurate result.See also
* Alteryx * Data preparation * OpenRefine * TrifactaReferences
External links
* {{Data Computer occupations Data mapping