散文網(wǎng) » 科技 »學(xué)習(xí) » 什么是數(shù)據(jù)科學(xué)？《What is data science》 by Mike Loukides翻譯和精讀03

什么是數(shù)據(jù)科學(xué)？《What is data science》 by Mike Loukides翻譯和精讀03

2023-07-22 23:13 作者:跳舞的Jennifer 0人讀過 | 我要投稿

Working with data at scale

處理大規(guī)模數(shù)據(jù)

We’ve all heard a lot about “big data,” but “big” is really a red herring. Oil companies, telecommunications companies, and other data-centric industries have had huge datasets for a long time. And as storage capacity continues to expand, today’s “big” is certainly tomorrow’s “medium” and next week’s “small.” The most meaningful definition I’ve heard: “big data” is when the size

of the data itself becomes part of the problem. We’re discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.

我們已經(jīng)聽了很多關(guān)于“大數(shù)據(jù)”的內(nèi)容，但是“大”真的是一條紅鯡魚。石油公司，通訊公司，和其它數(shù)據(jù)中心產(chǎn)業(yè)，長時間以來有著巨大的數(shù)據(jù)集。并且隨著存儲容量繼續(xù)膨脹，今天的“大”肯定是明天的“中”和下一周的“小”。我聽過的最有意義的定義：“大數(shù)據(jù)”是數(shù)據(jù)本身的規(guī)模成了問題的一部分的時候。我們正在討論從千兆到億萬范圍的數(shù)據(jù)問題。在某種程度上，處理數(shù)據(jù)的傳統(tǒng)技術(shù)，過時了。

?

注：red herring紅鯡魚在英語文化中指的是為轉(zhuǎn)移注意力而提出的不相干事實(shí)或論點(diǎn)。

?

What are we trying to do with data that’s different? According to Jeff Hammerbacher2?(@hackingdata), we’re trying to build information platforms or dataspaces. Information platforms are similar to traditional data warehouses, but different. They expose rich APIs, and are designed for exploring and understanding the data rather than for traditional analysis and reporting. They accept all data formats, including the most messy, and their schemas evolve as the understanding of the data changes.

我們要做什么來處理不一樣的數(shù)據(jù)？根據(jù)hackingdata的Jeff Hammerbacher所說，我們正在嘗試建立信息平臺或數(shù)據(jù)空間。信息平臺類似于傳統(tǒng)數(shù)據(jù)倉庫，但是不一樣。它們開發(fā)豐富的應(yīng)用程序接口，被設(shè)計為發(fā)現(xiàn)和理解數(shù)據(jù)而不是傳統(tǒng)的分析和報告。它們接受所有的數(shù)據(jù)形式，包括最雜亂的數(shù)據(jù)，它們的模式隨著對數(shù)據(jù)變化的理解而發(fā)展。

?

2.?“Information Platforms as Dataspaces,” by Jeff Hammerbacher (in Beautiful Data)

腳注2 “數(shù)據(jù)平臺作為一個數(shù)據(jù)空間”，Jeff Hammerbacher（在《Beautiful Data》這本書中）

?

Most of the organizations that have built data platforms have found it necessary to go beyond the relational database model. Traditional relational database systems stop being effective at this scale. Managing sharding and replication across a horde of database servers is difficult and slow. The need to

define a schema in advance conflicts with reality of multiple, unstructured data sources, in which you may not know what’s important until after you’ve analyzed the data. Relational databases are designed for consistency, to support complex transactions that can easily be rolled back if any one of a complex set of operations fails. While rock-solid consistency is crucial to many applications, it’s not really necessary for the kind of analysis we’re discussing here. Do you really care if you have 1,010 or 1,012 Twitter followers? Precision has an allure, but in most data-driven applications outside of finance, that allure is deceptive. Most data analysis is comparative: if you’re asking whether sales to Northern Europe are increasing faster than sales to Southern Europe, you aren’t concerned about the difference between 5.92 percent annual growth and 5.93 percent.

大部分已經(jīng)建立數(shù)據(jù)平臺的機(jī)構(gòu)，發(fā)現(xiàn)超越關(guān)系數(shù)據(jù)庫模型是非常有必要的。傳統(tǒng)的關(guān)系型數(shù)據(jù)庫系統(tǒng)在這種（數(shù)據(jù)）規(guī)模下不再有效?？缭揭蝗簲?shù)據(jù)庫服務(wù)器的管理分片和復(fù)制是困難和緩慢的。預(yù)定義模式的需要與多樣、非結(jié)構(gòu)化的數(shù)據(jù)源相沖突，在數(shù)據(jù)源中，你不會知道什么是重要的，直到你已經(jīng)分析了這些數(shù)據(jù)。關(guān)系型數(shù)據(jù)庫被設(shè)計為一致性，來支持一套復(fù)雜操作中的一個（事物）操作失敗后，可以輕易被回滾的復(fù)雜事務(wù)。盡管堅如磐石的一致性對很多應(yīng)用來說是關(guān)鍵的，對于我們在這里討論的特定分析，卻不是真正必要的。你真的會關(guān)心你是有1010個推特粉絲還是1012個嗎？精確性有吸引力，但是在金融之外的大部分?jǐn)?shù)據(jù)驅(qū)動應(yīng)用中，那種吸引力是誤導(dǎo)性的。大部分?jǐn)?shù)據(jù)分析是比較性的：如果你在咨詢北歐的銷售增長是否比南歐快，你不會關(guān)心5.92%和5.93%的年增長差別。

?

注：The need to define a schema in advance conflicts with reality of multiple, unstructured data sources, in which you may not know what’s important until after you’ve analyzed the data. 這句話的主語是the need，to define a schema in advance是主語的修飾，謂語是conflict with，賓語是reality ，什么樣的reality，reality of multiple, multiple, unstructured data sources。in which定語從句，是in data sources。

事務(wù)的回滾，是事務(wù)的一致性與原子性的結(jié)合，一個事務(wù)要么完全執(zhí)行，要么回滾到完全不執(zhí)行的狀態(tài)。

?

To store huge datasets effectively, we’ve seen a new breed of databases appear. These are frequently called NoSQL databases, or Non-Relational databases, though neither term is very useful. They group together fundamentally dissimilar products by telling you what they aren’t. Many of these databases are

the logical descendants of Google’s BigTable?and Amazon’s Dynamo, and are designed to be distributed across many nodes, to provide “eventual consis-tency” but not absolute consistency, and to have very flexible schema. While there are two dozen or so products available (almost all of them open source), a few leaders have established themselves:

要有效地存儲大型數(shù)據(jù)集，我們已經(jīng)看到了一種新的培育的數(shù)據(jù)庫出現(xiàn)了。他們通常被稱為NoSQL數(shù)據(jù)庫（非結(jié)構(gòu)化查詢數(shù)據(jù)庫，與結(jié)構(gòu)化數(shù)據(jù)庫相對比），或非關(guān)系型數(shù)據(jù)庫，盡管兩個術(shù)語中的任何一個都不是很有用。它們通過告訴你，它們不是什么來將根本上并不相似的產(chǎn)品分為一組。很多這種數(shù)據(jù)庫是谷歌BigTable和亞馬遜Dynamo的邏輯衍生物，被設(shè)計為跨越很多個節(jié)點(diǎn)的分布式，提供“最終一致性”但不是絕對的一致性，并有著非常靈活的模式。大概有至少24個產(chǎn)品是可用的（大部分都開源），少數(shù)領(lǐng)導(dǎo)者已經(jīng)站穩(wěn)了腳跟。

注：establish除了建立，還有穩(wěn)固的意思。

?

? Cassandra: Developed at Facebook, in production use at Twitter, Rackspace, Reddit, and other large sites. Cassandra is designed for high performance, reliability, and automatic replication. It has a very flexible data model. A new startup, Riptano, provides commercial support.

Cassandra（數(shù)據(jù)庫）：在臉書發(fā)展，在推特，Rackspace，Reddit和其它大型站點(diǎn)的產(chǎn)品使用中。Cassandra被設(shè)計為高性能，高可靠性，和自動復(fù)制。它有著非常靈活的數(shù)據(jù)模型。一個新的初創(chuàng)公司，Riptano，提供商業(yè)支持。

?

? HBase: Part of the Apache Hadoop project, and modelled on Google’s BigTable. Suitable for extremely large databases (billions of rows, millions of columns), distributed across thousands of nodes. Along with Hadoop, commercial support is provided by Cloudera.

HBase：Apache Hadoop工程的一部分，基于谷歌的BigTable建模。適用于超大型數(shù)據(jù)庫（數(shù)十億行，百萬列），在上千節(jié)點(diǎn)上分布。和Hadoop一樣，商業(yè)支持由Cloudera提供。

?

Storing data is only part of building a data platform, though. Data is only useful if you can do?something with it, and enormous datasets present computational problems. Google popularized the MapReduce?approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the “map” stage, a programming task

is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task. In hindsight, MapReduce seems like an obvious solution to Google’s biggest problem, creating large searches. It’s easy to distribute a search

across thousands of processors, and then combine the results into a single set of answers. What’s less obvious is that MapReduce has proven to be widely applicable to many large data problems, ranging from search to machine learning.

存儲數(shù)據(jù)只是構(gòu)造一個數(shù)據(jù)平臺的一部分。數(shù)據(jù)只有在你能用它做一些事的時候才是有用的，并且龐大的數(shù)據(jù)集帶來了計算問題。谷歌推廣了MapReduce算法，這種算法基于分治策略，在一個超大的計算集群中分布一個超大的問題。在“map”階段，一個編程任務(wù)被分為一系列的獨(dú)立子任務(wù)，這些子任務(wù)隨后被分布到很多個處理器上；中間結(jié)果隨后會由一個單個的reduce 任務(wù)所結(jié)合。事后看來，MapReduce看起來像是谷歌最大問題——創(chuàng)造大型搜索的一個明顯解決辦法。將一個搜索分布到上千個處理器上是很容易的，然后將結(jié)果結(jié)合到一個單獨(dú)的答案集合里。沒那么明顯的是MapReduce已經(jīng)證明廣泛適用于很多大型數(shù)據(jù)問題，從搜索到機(jī)器學(xué)習(xí)。

?

注：引號的map是映射的意思。reduce task的reduce是MapReduce的reduce，前面講完了map階段，后面介紹reduce task。

?

The most popular open source implementation of MapReduce is the Hadoop project. Yahoo’s claim that they had built the world’s largest production Hadoop application, with 10,000 cores running Linux, brought it onto center stage. Many of the key Hadoop developers have found a home at Cloudera,

which provides commercial support. Amazon’s Elastic MapReduce?makes it much easier to put Hadoop to work without investing in racks of Linux machines, by providing preconfigured Hadoop images for its EC2 clusters. You can allocate and de-allocate processors as needed, paying only for the time you use them.

MapReduce最流行的開源應(yīng)用是Hadoop工程。雅虎聲稱他們已經(jīng)用運(yùn)行在Linux上的10000個內(nèi)核，建立了世界上最大的Hadoop應(yīng)用產(chǎn)品，將這一產(chǎn)品帶到了中央舞臺。很多核心的Hadoop開發(fā)者在Cloudera找到了一個家，Cloudera提供商業(yè)支持。亞馬遜的Elastic MapReduce使得在不需要投資Linux機(jī)器架子的情況下將Hadoop投入使用變得更加容易，方法是為它的EC2集群提供預(yù)先設(shè)定的Hadoop圖像。你可以分配和回收需要的處理器，只需要在你使用時付費(fèi)。

?

注：這一段就是現(xiàn)在阿里云等云服務(wù)商提供的云服務(wù)。

?

Hadoop?goes far beyond a simple MapReduce implementation (of which there are several); it’s the key component of a data platform. It incorporates HDFS, a distributed filesystem designed for the performance and reliability requirements of huge datasets; the HBase database; Hive, which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig; and other components. If anything can be called a one-stop information platform, Hadoop is it.

Hadoop遠(yuǎn)遠(yuǎn)不止是一個簡單的MapReduce應(yīng)用（其中有幾個）；它是數(shù)據(jù)平臺的關(guān)鍵組件。

它內(nèi)嵌HDFS——一個被設(shè)計為滿足大型數(shù)據(jù)集的性能和可靠性要求的分布式文件系統(tǒng)；HBase數(shù)據(jù)庫；讓開發(fā)者使用類似SQL查詢來探索Hadoop數(shù)據(jù)集的Hive; 一種被稱為Pig的高階數(shù)據(jù)流語言；和其它組件。如果有什么東西可以被稱為一站式信息平臺，Hadoop就是。

注：第一句話的意思是，Hadoop包含多個MapReduce應(yīng)用。

?

Hadoop has been instrumental in enabling “agile” data analysis. In software development, “agile practices” are associated with faster product cycles, closer interaction between developers and consumers, and testing. Traditional data analysis has been hampered by extremely long turn-around times. If you start a calculation, it might not finish for hours, or even days. But Hadoop (and

particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Faster computations make it easier to test different assumptions, different datasets, and different algorithms. It’s easer to consult with clients to figure out whether you’re asking the right

questions, and it’s possible to pursue intriguing possibilities that you’d otherwise have to drop for lack of time.

Hadoop在實(shí)現(xiàn)“敏捷”數(shù)據(jù)分析成為可能上發(fā)揮了作用。在軟件發(fā)展中，“敏捷實(shí)踐”與更快的產(chǎn)品周期，開發(fā)者和消費(fèi)者之間更密切的交互，和測試（以上所有這些）關(guān)聯(lián)。傳統(tǒng)的數(shù)據(jù)分析被極其長的迭代時間所阻礙。如果你開始一個計算，它可能要幾個小時，甚至幾天才能結(jié)束。但是Hadoop（以及尤其是Elastic MapReduce）使得構(gòu)建可以在長數(shù)據(jù)集上執(zhí)行計算的集群變得容易。更快的計算讓測試不同的假設(shè)、不同的數(shù)據(jù)集和不同的算法變得容易的多。與客戶溝通來弄清楚你是否詢問了正確的問題變得更加自在，追求有趣的可能性是可能的，不然你會因?yàn)槿鄙偈录艞壛恕?/p>

注：關(guān)于turn-around time，軟件開發(fā)中，除了瀑布模型不需要迭代，其它模型的軟件開發(fā)過程都需要某一或某幾個階段的不斷迭代。

?

Hadoop is essentially a batch system, but Hadoop Online Prototype (HOP)?is an experimental project that enables stream processing. Hadoop processes data as it arrives, and delivers intermediate results in (near) real-time. Near real-time data analysis enables features like trending topics?on sites like Twitter. These features only require soft real-time; reports on trending topics don’t require millisecond accuracy. As with the number of followers on Twitter, a “trending topics” report only needs to be current to within five minutes -- or even an hour. According to Hilary Mason (@hmason), data scientist at bit.ly, it’s possible to precompute much of the calculation, then use one of the experiments in real-time MapReduce to get presentable results.

Hadoop本質(zhì)上是一個批處理系統(tǒng)，但是Hadoop Online Prototype (HOP)是一個能夠進(jìn)行流處理的實(shí)驗(yàn)項(xiàng)目。Hadoop在數(shù)據(jù)到達(dá)時處理數(shù)據(jù)，并（近似）實(shí)時傳遞中間結(jié)果。近似實(shí)時數(shù)據(jù)分析讓像推特這樣的網(wǎng)址的趨勢話題的特色得以實(shí)現(xiàn)。這些特點(diǎn)只需要軟的實(shí)時；趨勢話題的報告不需要毫秒精度。只要有一定數(shù)量的推特用戶，一個“趨勢話題”報告只需要5分鐘之內(nèi)——甚至一個小時內(nèi)流行的。根據(jù)hmason的在bit.ly的數(shù)據(jù)科學(xué)家Hilary Mason，提前運(yùn)算大部分計算是可能的，然后使用實(shí)時MapReduce實(shí)驗(yàn)結(jié)果的一種來獲得符合要求的結(jié)果。

?

Machine learning is another essential tool for the data scientist. We now expect web and mobile applications to incorporate recommendation engines, and building a recommendation engine is a quintessential artificial intelligence problem. You don’t have to look at many modern web applications to see classification, error detection, image matching (behind Google Goggles?and SnapTell) and even face detection -- an ill-advised mobile application lets you take someone’s picture with a cell phone, and look up that person’s identity using photos available online. Andrew Ng’s Machine Learning course?is one of the most popular courses in computer science at Stanford, with hundreds of students (this video is highly recommended).

機(jī)器學(xué)習(xí)是對數(shù)據(jù)科學(xué)家來說另一種必要的工具。我們現(xiàn)在期待網(wǎng)絡(luò)和移動應(yīng)用能內(nèi)嵌推薦引擎，建立一個推薦引擎是典型的人工智能問題。你不必非得用許多現(xiàn)代網(wǎng)絡(luò)應(yīng)用就可以看到分類，錯誤檢測，圖像匹配（在Google Googles和SnapTell背后），甚至面部檢測——一個不建議的移動應(yīng)用讓你用一部手機(jī)就能拿到一些人的圖片，還有用網(wǎng)上可用的照片來尋找人的身份信息。吳恩達(dá)的機(jī)器學(xué)習(xí)課程是斯坦福計算機(jī)科學(xué)系最受歡迎的課程，有數(shù)以百計的學(xué)生（高度推薦他的授課視頻）。

?

There are many libraries available for machine learning: PyBrain?in Python, Elefant, Weka?in Java, and Mahout?(coupled to Hadoop). Google has just announced their Prediction API, which exposes their machine learning algorithms for public use via a RESTful interface. For computer vision, the OpenCV?library is a de-facto standard.

有很多機(jī)器學(xué)習(xí)可用的庫：Python的PyBrain，Elefant，Java的Weka，還有Mahout（與Hadoop匹配）。谷歌剛剛宣布他們的Prediction API，通過一個閑置的借口來面向公眾開放他們的機(jī)器學(xué)習(xí)算法。對機(jī)器視覺，OpenCV庫是一個事實(shí)上的標(biāo)準(zhǔn)。

?

Mechanical Turk?is also an important part of the toolbox. Machine learning almost always requires a “training set,” or a significant body of known data with which to develop and tune the application. The Turk is an excellent way to develop training sets. Once you’ve collected your training data (perhaps a

large collection of public photos from Twitter), you can have humans classify?them inexpensively -- possibly sorting them into categories, possibly drawing circles around faces, cars, or whatever interests you. It’s an excellent way to classify a few thousand data points at a cost of a few cents each. Even a relatively large job only costs a few hundred dollars.

Mechanical Turk同樣是工具箱的一個重要部分。機(jī)器學(xué)習(xí)幾乎總是需要一個“訓(xùn)練集”，或一個用于開發(fā)和調(diào)整應(yīng)用的已知數(shù)據(jù)的重要主體。Turk是一個開發(fā)訓(xùn)練集的非常好的方式。一旦你已經(jīng)收集了你的訓(xùn)練數(shù)據(jù)（也許是從推特收集到的大量公共照片），你就能讓人工花費(fèi)不多地對它們分類——可能是將它們分類到目錄，可能是在臉部附近、車子附近或任何讓你感興趣的地方周圍畫圓圈。這是一種對幾千個數(shù)據(jù)點(diǎn)進(jìn)行分類來說很好的方式，每個數(shù)據(jù)點(diǎn)只需要幾美分。甚至一份相對大的作業(yè)只是花費(fèi)幾百美元。

?

注：possibly drawing circles around faces, cars, or whatever interests you. 就是數(shù)據(jù)標(biāo)注的拉框。

?

While I haven’t stressed traditional statistics, building statistical models plays an important role in any data analysis. According to Mike Driscoll?(@dataspora), statistics is the “grammar of data science.” It is crucial to “making data speak coherently.” We’ve all heard the joke that eating pickles causes death,

because everyone who dies has eaten pickles. That joke doesn’t work if you understand what correlation means. More to the point, it’s easy to notice that one advertisement for R in a Nutshell?generated 2 percent more conversions than another. But it takes statistics to know whether this difference is significant, or just a random fluctuation. Data science isn’t just about the existence

of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid. Statistics plays a role in everything from traditional business intelligence (BI) to understanding how Google’s ad auctions work. Statistics has become a basic skill. It isn’t superseded by newer techniques from machine learning and other disciplines; it complements them.

雖然我還沒有強(qiáng)調(diào)過傳統(tǒng)的統(tǒng)計學(xué)，建立統(tǒng)計學(xué)模型在任何數(shù)據(jù)分析中都起著一個重要的作用。根據(jù)dataspora的Mike Driscoll，統(tǒng)計學(xué)是“數(shù)據(jù)科學(xué)的語法”。對于“讓數(shù)據(jù)連貫地說話”來說，統(tǒng)計學(xué)是關(guān)鍵的。我們都聽說過吃腌菜引起死亡的笑話，因?yàn)槊總€死去的人都吃過腌菜。如果你理解相關(guān)性意味著什么，那個笑話就不會起作用。更為重要的是，很容易注意到在《R in Nutshell》的一條廣告能夠比別的多生成2%的轉(zhuǎn)化率。但是需要統(tǒng)計學(xué)才能知道，這個差異是否是顯著的，或是否只是一個隨機(jī)的波動。統(tǒng)計學(xué)不僅僅是數(shù)據(jù)的存在，或是讓關(guān)于數(shù)據(jù)可能意味著什么來做猜測；它是驗(yàn)證假設(shè)和確定你從數(shù)據(jù)得到的結(jié)論是有效的。統(tǒng)計學(xué)已經(jīng)成了一個基本技能。它沒有被更新的技術(shù)——從機(jī)器學(xué)習(xí)到其它訓(xùn)練方法所取代，它（指統(tǒng)計學(xué)）補(bǔ)充了它們（指機(jī)器學(xué)習(xí)和其它訓(xùn)練方法）。

?

While there are many commercial statistical packages, the open source R language?-- and its comprehensive package library, CRAN?-- is an essential tool. Although R is an odd and quirky language, particularly to someone with a background in computer science, it comes close to providing “one stop shopping” for most statistical work. It has excellent graphics facilities; CRAN includes parsers for many kinds of data; and newer extensions extend R into distributed computing. If there’s a single tool that provides an end-to-end solution for statistics work, R is it.

盡管有很多商業(yè)統(tǒng)計包，開源R語言——和它的綜合包庫CRAN——一個基本的，必不可少的工具。盡管R是一種奇怪的語言，特別是對于有計算機(jī)科學(xué)背景的人來說，它（指R語言）近乎為大部分統(tǒng)計工作提供了“一站式購物”。它有著優(yōu)秀的圖形裝置；CRAN含有對很多種數(shù)據(jù)的語法分析；而且新擴(kuò)展將R擴(kuò)展到分布式計算的領(lǐng)域。如果有一個單獨(dú)的工具能夠?yàn)榻y(tǒng)計工作提供端到端的解決方案，R語言就是。

?

注：對有計算機(jī)科學(xué)或軟件背景的學(xué)生來說，C語言等語言早已深入人心，R語言和編程語言的內(nèi)部邏輯很不一樣。

?

標(biāo)簽：數(shù)據(jù)分析數(shù)據(jù)挖掘數(shù)據(jù)科學(xué)data science

什么是數(shù)據(jù)科學(xué)？《What is data science》 by Mike Loukides翻譯和精讀03的評論 (共條)

愛情散文傷感散文哲理散文優(yōu)美生活隨筆親情唯美句子傷感的句子現(xiàn)代詩歌空間日志經(jīng)典語句愛情句子作文大全

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

什么是數(shù)據(jù)科學(xué)？《What is data science》 by Mike Loukides翻譯和精讀03

什么是數(shù)據(jù)科學(xué)？《What is data science》 by Mike Loukides翻譯和精讀03的評論 (共條)

你可能也喜歡這些文章

最新發(fā)布的文章

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

什么是數(shù)據(jù)科學(xué)？《What is data science》 by Mike Loukides翻譯和精讀03

本文作者的其他文章

什么是數(shù)據(jù)科學(xué)？《What is data science》 by Mike Loukides翻譯和精讀03的評論 (共 條)

你可能也喜歡這些文章

最新發(fā)布的文章

什么是數(shù)據(jù)科學(xué)？《What is data science》 by Mike Loukides翻譯和精讀03的評論 (共條)