HOME

TheInfoList



OR:

Hancock is a C-based programming language, first developed by researchers at
AT&T Labs AT&T Labs is the research & development division of AT&T, the telecommunications company. It employs some 1,800 people in various locations, including: Bedminster NJ; Middletown, NJ; Manhattan, NY; Warrenville, IL; Austin, TX; Dallas, TX; Atla ...
in 1998, to analyze
data stream In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded coherent signals to convey information. Typically, the transmitted symbols are grouped into a series of packets. Data streaming has b ...
s. The language was intended by its creators to improve the efficiency and scale of data mining. Hancock works by creating profiles of individuals, utilizing data to provide behavioral and social network information. The development of Hancock was part of the telecommunications industry's use of data mining processes to detect fraud and to improve marketing. However, following the September 11, 2001 attacks, and the increased government surveillance of individuals, Hancock and similar data mining technologies came into public scrutiny, especially regarding its perceived threat to individual privacy.


Background

Data mining research, including Hancock, grew during the 1990s, as scientific, business, and medical interest in massive data collection, storage, and management increased. During the early 1990s, transactional businesses became increasingly interested in data warehousing, which provided storage, query, and management capabilities for the entirety of recorded transactional data. Data mining research with a focus on databases became focused on creating efficient data structures and algorithms, particularly for data which was located off of main memory storage, on a disk, for example. Padharic Smyth believed that data mining researchers aimed to write algorithms which could scale the massive amounts of data in shorter amounts of time. Researchers at
AT&T Labs AT&T Labs is the research & development division of AT&T, the telecommunications company. It employs some 1,800 people in various locations, including: Bedminster NJ; Middletown, NJ; Manhattan, NY; Warrenville, IL; Austin, TX; Dallas, TX; Atla ...
, including
Corinna Cortes Corinna Cortes is a Danish computer scientist known for her contributions to machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award for her work on theoreti ...
, pioneered the Hancock programming language from 1998 to 2004. Hancock, a C-based domain specific programming language, was intended to make program code for computing signatures from large transactional
data stream In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded coherent signals to convey information. Typically, the transmitted symbols are grouped into a series of packets. Data streaming has b ...
s easier to read and maintain, thus serving as an improvement over the complex data mining programs written in C. Hancock also managed issues of scale for data mining programs. The data streams Hancock programs analyzed were intended to handle hundreds of millions of signatures daily, ideally suited for transactions like telephone calls, credit card purchases, or website requests. At the time Hancock was developed, this data were usually amassed for billing or security purposes, and increasingly, to analyze how transactors behaved. Data mining can also be useful for identifying atypical patterns in transactor data. In regards to anti-terrorist activities, data mining’s assistance in pattern-finding can help find links between terrorist suspects, through funding or arms transfers, for example. Data stream applications also include network monitoring, financial monitoring, such as security
derivative In mathematics, the derivative of a function of a real variable measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value). Derivatives are a fundamental tool of calculus. ...
pricing, prescription drug effect monitoring, and
e-commerce E-commerce (electronic commerce) is the activity of electronically buying or selling of products on online services or over the Internet. E-commerce draws on technologies such as mobile commerce, electronic funds transfer, supply chain managem ...
. Data mining can be used by firms to find their most profitable consumers or to conduct
churn Churn may refer to: * Churn drill, large-diameter drilling machine large holes appropriate for holes in the ground Dairy-product terms * Butter churn, device for churning butter * Churning (butter), the process of creating butter out of mil ...
analysis. Data mining can also help firms make credit-lending decisions by designing models which determine a customer’s credit worthiness. These models are intended to minimize risky credit-lending while maximizing sales revenue. Besides Hancock, other data stream systems in existence by 2003 included Aurora, Gigascope, Niagara, STREAM, Tangram, Tapestry, Telegraph, and Tribeca.


Processes


Databases

Hancock is a language for data stream mining programs.
Data stream In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded coherent signals to convey information. Typically, the transmitted symbols are grouped into a series of packets. Data streaming has b ...
s differ from traditional stored databases in that they experience very high volumes of data and allow analysts to act upon such data in near-real time. Stored databases, on the other hand, involve data being inputted for offline querying.
Data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
s, which store intersectional data from different systems, can be costly to build and lengthy to implement. Simplified data warehouses can take months to build. The scale of massive data stream mining poses problems to data miners. For example, internet and telephone network data mining might be tasked with finding persistent items, which are items that regularly occur in the stream. However, these items might be buried in a large amount of the network’s transactional data; while the items can eventually be found, data miners aim for increased time efficiency in their search. In database technology, users do not necessarily know where the data they are searching for is located. These users only have to issue queries for data, which the database management system returns. In a large data set, data can be contained in
random-access memory Random-access memory (RAM; ) is a form of computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost the ...
(RAM), which is the primary storage, or disk storage, which is secondary storage. In 2000, Padharic Smyth estimated that, using the most recent technology, data located in RAM could be accessed relatively quickly, “on the order of 10−7-10−8 seconds,” while secondary storage data took significantly longer to access, “on the order of 104-105” seconds.


Data mining

Data mining can be broken down into the processes of input, analysis, and the reporting of results; it uses algorithms to find patterns and relationships among the subjects and has been used by commercial companies to find patterns in client behavior. Data analysts are needed to collect and organize data and train algorithms. KianSing Ng and Huan Liu opine that even with straightforward data mining goals, the actual process is still complex. For example, they argue that real-world data mining can be challenged by data fluctuations, which would render prior patterns “partially invalid.” Another complication is that most databases in existence in 2000 were characterized by high dimensionality, which means that they contain data on many attributes. As Ng and Liu note, high dimensionality produces long computing times; this can be solved by data reduction in the pre-processing stage. Hancock's process is as follows: * Hancock programs analyze data as it arrives, in real-time, into data warehouses. * Hancock programs computed the signatures, or behavioral profiles, of transactors in the stream. ** Data stream transactors include telephone numbers or
IP address An Internet Protocol address (IP address) is a numerical label such as that is connected to a computer network that uses the Internet Protocol for communication.. Updated by . An IP address serves two main functions: network interface ident ...
es. ** Signatures enable analysts to discover patterns hidden in the data. *
Telecommunication Telecommunication is the transmission of information by various types of technologies over wire, radio, optical, or other electromagnetic systems. It has its origin in the desire of humans for communication over a distance greater than that fe ...
s data streams consist of call-records, which include information on the locations of callers, time of calls, and sometimes include recordings of conversations. ** Hancock was used to process signatures based on data like the length of phone calls and the amount of calls to a particular area over a specified interval of time. * Hancock programs used link analysis to find “communities of interest," which connected signatures based on similarities in behavior.
Link analysis In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Lin ...
require that linkages between data are continually updated, and are used to detect fraud networks. ** Link analysis, which can be considered a form of association data mining, aims to find connections between relationships. One such relationship is call patterns in telecommunications. Association data mining aims to find relationships between variables. For example, one research paper suggested that a market could use association analysis to find the probability that a customer who purchases coffee also purchases bread; the market could then use that information to influence store layout and promotions. Because Hancock code performed efficiently, even with large amounts of data, the AT&T researchers claimed that it allowed analysts to create applications "previously thought to be infeasible."


Applications

The
AT&T Labs AT&T Labs is the research & development division of AT&T, the telecommunications company. It employs some 1,800 people in various locations, including: Bedminster NJ; Middletown, NJ; Manhattan, NY; Warrenville, IL; Austin, TX; Dallas, TX; Atla ...
researchers analyzed telecommunications data streams, including the company’s entire long distance stream, which included around 300 million records from 100 million customer accounts daily. By 2004, the entirety of AT&T's long-distance phone call record signatures were written in Hancock and the company used Hancock code to peruse through nine gigabytes of network traffic, nightly. Telecommunications companies share information derived from data mining network traffic for research, security, and regulatory purposes.


Marketing

Hancock programs assisted in AT&T's marketing efforts. In the 1990s, large data stream mining and the increased automation of government public record systems allowed commercial corporations in the United States to personalize marketing. Signature profiles were developed from both transaction records and public record sources. Ng and Liu, for example, applied data mining to customer retention analysis, and found that mining of association rules allowed a firm to predict departures of influential customers and their associates. They argued that such knowledge subsequently empowers the company’s marketing team to target those customers, offering more attractive pitches. Data mining assisted telecommunications companies in
viral marketing Viral marketing is a business strategy that uses existing social networks to promote a product mainly on various social media platforms. Its name refers to how consumers spread information about a product with other people, much in the same way tha ...
, also known as buzz marketing or word-of-mouth marketing, which uses consumer social networks to improve brand awareness and profit. Viral marketing depends on connections between consumers to increase brand advocacy, which can either be explicit, such as friends recommending a product to other friends, or implicit, such as influential consumers purchasing a product. For firms, one of the goals of viral marketing is to find influential consumers who have larger networks. Another method of viral marketing is to target the neighbors of prior consumers, known as “network targeting.” Using Hancock programs, analysts at AT&T were able to find "communities of interest," or interconnected users who featured similar behavioral traits. One of the issues viral marketing promoters encountered was the large size of marketing data sets, which, in the case of telecommunication companies, can include information on transactors and their descriptive attributes and transactions. Marketing data sets, when amounting in the hundreds of millions, can exceed the memory capacity of statistical analysis software. Hancock programs addressed data scaling issues and allowed analysts to make decisions as the data flowed into the data warehouses. While the development of wireless communication devices allowed law enforcement to track the location of users, it also allowed companies to improve consumer marketing, such as by sending messages according to wireless user’s proximity to particular businesses. Through cell site location data, Hancock programs were capable of tracking wireless users' movements. According to academic Alan Westin, the increase of telemarketing during this period also increased consumer annoyance. Statisticians Murray Mackinnon and Ned Glick hypothesized in 1999 that firms hid their use of commercial data mining because of potential consumer backlash for mining customer records. As an example, Mackinnon and Glick cited a June 1999 lawsuit in which the state of Minnesota sued US Bancorp for releasing customer information to a telemarketing firm; Bancorp promptly responded to the lawsuit by restricting its usage of customer data.


Fraud detection

AT&T researchers, including
Cortes Cortes, Cortés, Cortês, Corts, or Cortès may refer to: People * Cortes (surname), including a list of people with the name ** Hernán Cortés (1485–1547), a Spanish conquistador Places * Cortes, Navarre, a village in the South border of ...
, showed that Hancock-related data mining programs could be used for finding telecommunications fraud. Telecommunications fraud detection includes subscription fraud, unauthorized calling card usage, and PBX fraud. It is similar to mobile communications and credit card fraud: in all three, firms must process large amounts of data in order to obtain information; they must deal with the unpredictability of human behavior, which makes finding patterns in the data difficult; and their algorithms must be trained to spot the relatively rare cases of fraud among the many legitimate transactions. According to Daskalaki ''et al.'', in 1998, telecommunications fraud incurred billions of dollars in annual losses globally. Because fraud cases were relatively few compared to the hundreds of millions of daily telephone transactions that occurred, algorithms for data mining of telecommunication records need to provide results quickly and efficiently. The researchers showed that communities of interest could identify fraudsters since data nodes from fraudulent accounts are typically located closer to each other than to a node from a legitimate account. Through social network analyses and link analysis, they also found that the set of numbers that were targeted by fraudulent accounts, which were then disconnected, were often called on by fraudsters from different numbers; such connections could be used to identify fraudulent accounts. Link analysis methods are based on the assumption that fraudsters rarely deviate from their calling habits.


Relation to surveillance

In 2007, ''
Wired ''Wired'' (stylized as ''WIRED'') is a monthly American magazine, published in print and online editions, that focuses on how emerging technologies affect culture, the economy, and politics. Owned by Condé Nast, it is headquartered in San ...
'' magazine published an online article claiming that Hancock was created by AT&T researchers for "surveillance purposes." The article highlighted research papers written by Cortes ''et al.'', particularly the researchers' concept of "communities of interest." The article connected Hancock's concept with the recent public findings that the
Federal Bureau of Investigation The Federal Bureau of Investigation (FBI) is the domestic intelligence and security service of the United States and its principal federal law enforcement agency. Operating under the jurisdiction of the United States Department of Justice, ...
(FBI) had been making warrantless requests for records of "communities of interest" from telecommunication companies under the
USA PATRIOT Act The USA PATRIOT Act (commonly known as the Patriot Act) was a landmark Act of the United States Congress, signed into law by President George W. Bush. The formal name of the statute is the Uniting and Strengthening America by Providing Appro ...
. The article claims that AT&T "invented the concept and the technology" of creating "community of interest" records, citing the company's ownership of related data mining patents. Finally, the article noted how AT&T, along with
Verizon Verizon Communications Inc., commonly known as Verizon, is an American multinational telecommunications conglomerate and a corporate component of the Dow Jones Industrial Average. The company is headquartered at 1095 Avenue of the Americas ...
, was, at the time, being sued in federal court for providing the
National Security Agency The National Security Agency (NSA) is a national-level intelligence agency of the United States Department of Defense, under the authority of the Director of National Intelligence (DNI). The NSA is responsible for global monitoring, collect ...
(NSA) with access to billions of telephone records belonging to Americans. The NSA, the article claims, obtained such data with the intention of data mining it to find suspected terrorists and warrantless wiretapping targets.


FBI telecommunication records surveillance

Federal telecommunications surveillance is not a recent historical development in the United States. According to academic Colin Agur, telephone surveillance by law enforcement in the United States became more common in the 1920s. Particularly, telephone wiretapping became a prevalent form of evidence collection by law enforcement officials, especially federal agents, during Prohibition. Agur argues that the Communications Act of 1934, which established the Federal Communications Commission, reigned in law enforcement abuse of telephone surveillance. Under the act, telecommunications companies could keep records and report to the FCC illegal telecommunications interception requests. After the Federal Wiretap Act of 1968 and the Supreme Court's decision in ''
Katz v. United States ''Katz v. United States'', 389 U.S. 347 (1967), was a landmark decision of the U.S. Supreme Court in which the Court redefined what constitutes a "search" or "seizure" with regard to the protections of the Fourth Amendment to the U.S. Constituti ...
'', both of which extended Fourth Amendment protections to telephone communications, federal telecommunications surveillance required warrants. he FBI was first authorized to obtain
national security letter A national security letter (NSL) is an administrative subpoena issued by the United States government to gather information for national security purposes. NSLs do not require prior approval from a judge. The Stored Communications Act, Fair Cre ...
s (NSLs) for communication billings records, including those from telephone services, after Congress passed the Electronic Communications Privacy Act of 1986. The letters forced telephone companies to provide the FBI with customer information, such as names, addresses, and long-distance call records. Congress would eventually expand NSL authority to include warrants for local-distance call records as well. After the September 11, 2001 attacks, Congress passed the
USA PATRIOT Act The USA PATRIOT Act (commonly known as the Patriot Act) was a landmark Act of the United States Congress, signed into law by President George W. Bush. The formal name of the statute is the Uniting and Strengthening America by Providing Appro ...
, which made it easier for investigators at the FBI to be issued national security letters for terrorism investigations (NSLs). Academics William Bendix and Paul Quirk contend that the PATRIOT Act allowed the FBI to access and collect the private data of many citizens, without the approval of a judge. The FBI was allowed to keep a collection of records, with no time limit for possession. It could also force NSL recipients to remain silent through the use of gag orders. The ''Wired'' article claimed that the FBI began making warrantless requests to telecommunication companies for "communities of interest" records of suspects under the USA PATRIOT Act. The article claimed that law enforcement discovered the existence of such records based on research by Hancock's creators. In 2005, government leaks revealed the FBI’s abuse of NSLs. In 2006, when the PATRIOT Act was renewed, it included provisions that required the Justice Department’s inspector general to annually review NSL usage. The first inspector general report found that 140,000 NSL requests, on nearly 24,000 U.S. persons, were granted to FBI agents from 2003 to 2005. The data was then added to databanks available to thousands of agents.


NSA telecommunication records surveillance

The public-private relationship of telecommunication companies extends into the
homeland security Homeland security is an American national security term for "the national effort to ensure a homeland that is safe, secure, and resilient against terrorism and other hazards where American interests, aspirations, and ways of life can thrive" t ...
domain. Telecommunication companies, including
AT&T AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile te ...
,
Verizon Verizon Communications Inc., commonly known as Verizon, is an American multinational telecommunications conglomerate and a corporate component of the Dow Jones Industrial Average. The company is headquartered at 1095 Avenue of the Americas ...
, and
BellSouth BellSouth, LLC (stylized as ''BELLSOUTH'' and formerly known as BellSouth Corporation) was an American telecommunications holding company based in Atlanta, Georgia. BellSouth was one of the seven original Regional Bell Operating Companies after ...
, cooperated with NSA requests for access to transactional records. Telecommunications companies, including AT&T, have maintained partnerships with government agencies, like the Department of Homeland Security, to collaborate on sharing information and solving national cybersecurity issues. AT&T representatives sit on the board of the
National Cyber Security Alliance The National Cyber Security Alliance (NCA), is a 501(c)(3) USA non-profit organization founded in 2001, promoting cyber security, privacy, education, and awareness. The NCA works with various stakeholders in the government, industry and civil soc ...
(NCSA), which promotes cybersecurity awareness and computer user protection. Analysts at the NSA, under authority of the secret Terrorist Surveillance Program, also used data mining to find terrorist suspects and sympathizers. In this search, the NSA intercepted communications, including telephone calls, leaving and entering the United States. Agents screened the information for possible links to terrorism, such as the desire to learn to fly planes or specific locations of the communication’s recipients, like Pakistan. In 2005,
the New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid d ...
reported on the existence of the program, which the Bush administration defended as necessary in its counterterrorism efforts and limited to terrorist suspects and associates. However, in 2007, the ''Wired'' article noted how AT&T and Verizon were being sued in federal court for providing the NSA with access to billions of telephone records belonging to Americans for anti-terrorism activities, such as using data mining to locate suspected terrorists and warrantless wiretapping targets. In 2013, following the Snowden leaks, it was revealed that the program had also mined the communications of not just terrorist suspects, but also millions of American citizens. A 2014 independent audit by the Privacy and Civil Liberties Oversight Board found that the program had limited counterterrorism benefits.


See also

* Data mining * C (programming language) *
Social network analysis Social network analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of ''nodes'' (individual actors, people, or things within the network) ...
*
Viral marketing Viral marketing is a business strategy that uses existing social networks to promote a product mainly on various social media platforms. Its name refers to how consumers spread information about a product with other people, much in the same way tha ...
*
Fraud detection In law, fraud is intentional deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud can violate civil law (e.g., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compensa ...
*
September 11 attacks The September 11 attacks, commonly known as 9/11, were four coordinated suicide terrorist attacks carried out by al-Qaeda against the United States on Tuesday, September 11, 2001. That morning, nineteen terrorists hijacked four commer ...
*
USA PATRIOT Act The USA PATRIOT Act (commonly known as the Patriot Act) was a landmark Act of the United States Congress, signed into law by President George W. Bush. The formal name of the statute is the Uniting and Strengthening America by Providing Appro ...
* National security letter (NSL)


References

{{reflist Programming languages created in 1998