什么是數(shù)據(jù)科學(xué)?《What is data science》 by Mike Loukides翻譯和精讀04
Making data tell its story
讓數(shù)據(jù)說出它自己的故事
A picture may or may not be worth a thousand words, but a picture is certainly worth a thousand numbers. The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph. Edward Tufte’s Visual Display of Quantitative Information?is the classic for data visualization, and a foundational text for anyone practicing data science. But that’s not really what concerns us here. Visualization is crucial to each stage of the data scientist. According to Martin Wattenberg?(@wattenberg, founder of Flowing Media), visualization is key to data conditioning: if you want to find out just how bad your data is, try plotting it. Visualization is also frequently the first step in?analysis. Hilary Mason says that when she gets a new data set, she starts by?making a dozen or more scatter plots, trying to get a sense of what might be interesting. Once you’ve gotten some hints at what the data might be saying, you can follow it up with more detailed analysis.
一張圖片可能值得或不值得一千個(gè)字,但是一張圖片肯定值得一千個(gè)數(shù)字。大部分?jǐn)?shù)據(jù)分析算法的問題是它們產(chǎn)生了一系列的數(shù)字。要理解這些數(shù)字意味著什么,它們真正訴說的故事,你需要生成一張圖表。Edward Tufte的《Visual Display of Quantitative Information》是數(shù)據(jù)可視化的經(jīng)典,對于任何練習(xí)數(shù)據(jù)科學(xué)的人來說都是一個(gè)基礎(chǔ)的文本。但那并不是真的讓我們在這里認(rèn)為是重要的東西。可視化對數(shù)據(jù)科學(xué)家的每一個(gè)階段來說都是關(guān)鍵的。根據(jù)(Flowing Media的創(chuàng)始人)Martin Wattenberg所說,可視化是數(shù)據(jù)調(diào)節(jié)的關(guān)鍵:如果你想要找到你的數(shù)據(jù)有多糟糕,試著繪制它??梢暬ǔJ欠治龅牡谝徊?。Hilary Mason稱當(dāng)她獲得了一個(gè)新的數(shù)據(jù)集,她從制作至少一打的散點(diǎn)圖開始,試著獲得什么可能是有趣的感覺。一旦你已經(jīng)就數(shù)據(jù)可能要說一些什么獲得了一些提示,你就能使用更多細(xì)節(jié)化的分析來將它深入下去。
?
注:任何數(shù)據(jù),不管是文本,圖像,音頻,視頻,在計(jì)算機(jī)中都是以二進(jìn)制的形式存儲,也就是數(shù)字的形式存儲。而人工智能提取出的“特征”,也是計(jì)算機(jī)能處理的數(shù)字矩陣。
?
There are many packages for plotting and presenting data. GnuPlot?is very effective; R incorporates a fairly comprehensive graphics package; Casey Reas’ and Ben Fry’s Processing?is the state of the art, particularly if you need to create animations that show how things change over time. At IBM’s Many Eyes, many of the visualizations are full-fledged interactive applications.
有很多用來繪制和表現(xiàn)數(shù)據(jù)的包。GnuPlot非常高效;R含有一個(gè)非常綜合性的圖形包;Casey Reas和Ben Fry的《Processing》是一種藝術(shù),尤其是如果你需要制作變化如何隨時(shí)間產(chǎn)生的動畫。在IBM的Many Eyes中,許多種可視化都是羽翼豐滿的交互式應(yīng)用。
?
注:2007年MIT Press出版了Casey Reas和Ben Fry著作的書籍《Processing》。
?
?
Nathan Yau’s FlowingData?blog is a great place to look for creative visualizations. One of my favorites is this animation of the growth of Walmart?over time. And this is one place where “art” comes in: not just the aesthetics of the visualization itself, but how you understand it. Does it look like the spread of
cancer throughout a body? Or the spread of a flu virus through a population? Making data tell its story isn’t just a matter of presenting results; it involves making connections, then going back to other data sources to verify them. Does a successful retail chain spread like an epidemic, and if so, does that give
us new insights into how economies work? That’s not a question we could even have asked a few years ago. There was insufficient computing power, the data was all locked up in proprietary sources, and the tools for working with the data were insufficient. It’s the kind of question we now ask routinely.
要尋找有創(chuàng)意的可視化,Nathan Yau的《FlowingData》日志是一個(gè)好地方。我最喜歡的一個(gè)是隨著時(shí)間《Walmart成長》的動畫。這也是“藝術(shù)”進(jìn)入的地方:不止是視覺本身的美學(xué),還關(guān)于你如何理解它。它看起來像是癌癥在你體內(nèi)的擴(kuò)散嗎?或是流感病毒在人群中的傳播嗎?讓數(shù)據(jù)說出它自己的故事不止是一件呈現(xiàn)結(jié)果的事;它還包括制作聯(lián)系,然后返回其它數(shù)據(jù)源進(jìn)行驗(yàn)證。如果一個(gè)成功的零售鏈條像傳染病一樣傳播,這是否給我們經(jīng)濟(jì)如何起作用的新視角?那并不是一個(gè)我們能在幾年前問出的問題。那時(shí)沒有足夠的算力,數(shù)據(jù)都被鎖在專有的源中,處理數(shù)據(jù)的工具不足。這是我們現(xiàn)在例行問的問題。
?
Data scientists
數(shù)據(jù)科學(xué)家
Data science requires skills ranging from traditional computer science to mathematics to art.Describing the data science group he put together at Facebook (possibly the first data science group at a consumer-oriented web property), Jeff Hammerbacher said:
數(shù)據(jù)科學(xué)需要從傳統(tǒng)計(jì)算機(jī)科學(xué)到數(shù)學(xué)到藝術(shù)的各種技能。Jeff Hammerbacher描述他在Facebook上進(jìn)行的數(shù)據(jù)科學(xué)分組(可能是第一個(gè)以消費(fèi)者為導(dǎo)向的網(wǎng)絡(luò)屬性的數(shù)據(jù)科學(xué)分組)時(shí)說到:
?
... on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization3?
任何一天,一個(gè)小組成員可以使用Python來編寫一個(gè)多段處理管道,設(shè)計(jì)一個(gè)假說測試,用R來進(jìn)行一份數(shù)據(jù)樣本上的回歸分析,為Hadoop上的一些數(shù)據(jù)密集型產(chǎn)品或服務(wù)來設(shè)計(jì)并實(shí)施一個(gè)算法,或?qū)⑽覀兊姆治鼋Y(jié)果交流給組織的其它成員3。
?
3.?“Information Platforms as Dataspaces,” by Jeff Hammerbacher (in Beautiful Data)
腳注3,“信息平臺作為一個(gè)數(shù)據(jù)空間”,Jeff Hammerbacher在《Beautiful Data》書中寫到。
?
Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn?(@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.
你要在哪里找到這樣多功能的復(fù)合人才?根據(jù)LinkedIn的首席科學(xué)家DJ Patil所說,最好的數(shù)據(jù)科學(xué)家傾向于是“硬科學(xué)家”,最好是物理學(xué)家而不是計(jì)算機(jī)科學(xué)專業(yè)。物理學(xué)家有很強(qiáng)的數(shù)學(xué)背景,計(jì)算技能,來自一個(gè)生存依賴于從數(shù)據(jù)中獲取最多的學(xué)科。他們不得不思考大的圖片,大的問題。當(dāng)你剛剛花了很多撥款生成數(shù)據(jù),如果數(shù)據(jù)不是那么整潔,你不喜歡,你也不能只是因?yàn)檫@些原因就將你不喜歡和不那么整潔的數(shù)據(jù)扔掉。你必須要讓數(shù)據(jù)說出他自己的故事。當(dāng)數(shù)據(jù)所說的故事不是你認(rèn)為它說的,這時(shí),你需要一些創(chuàng)造性。
?
注:李飛飛本科是普林斯頓大學(xué)的物理學(xué)學(xué)士,后成為電子博士。在人工智能領(lǐng)域大家都注重算法,忽略數(shù)據(jù)時(shí),李飛飛另辟蹊徑,從數(shù)據(jù)入手。這段話想必能夠讓很多人想到李飛飛的經(jīng)歷。
?
Scientists also know how to break large problems up into smaller problems. Patil described the process of creating the group recommendation feature at LinkedIn. It would have been easy to turn this into a high-ceremony development project that would take thousands of hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn’s membership. But the process worked quite differently: it started out with a relatively small, simple program that looked at members’ profiles and made recommendations accordingly. Asking things like, did you go to Cornell? Then you might like to join the Cornell Alumni group. It then branched out incrementally. In?addition to looking at profiles, LinkedIn’s data scientists started looking at events that members attended. Then at books members had in their libraries. The result was a valuable data product that analyzed a huge database —but it was never conceived as such. It started small, and added value iteratively. It was an agile, flexible process that built toward its goal incrementally, rather than tackling a huge mountain of data all at once.
?
科學(xué)家同樣知道如何將大型問題拆開為更小的問題。Patil 描述在LinkedIn創(chuàng)建群組推薦特征的過程。將這些轉(zhuǎn)換成儀式隆重的開發(fā)工程是容易的,但是這些工程要花費(fèi)上千小時(shí)的開發(fā)者時(shí)間,加上上千小時(shí)的計(jì)算時(shí)間來在LinkedIn的成員關(guān)系中建立海量的關(guān)聯(lián)。但是處理過程非常不一樣:它以一個(gè)相對小,簡單的程序開始,這個(gè)程序查看成員的簡介,并根據(jù)簡介做出推薦。問一些像這樣的問題,你去康奈爾大學(xué)嗎?然后你可能會想要加入康奈爾大學(xué)校友會。然后它逐漸分開擴(kuò)展(根據(jù)你回答的是或否,有不同的后續(xù))。除了查看簡介,LinkedIn的數(shù)據(jù)科學(xué)家開始查看成員參加的事件。然后查看成員從他們的圖書館中看的書。結(jié)果是一個(gè)有價(jià)值的數(shù)據(jù)產(chǎn)品——但是它從未被設(shè)想成這樣。它開始時(shí)很微小,是價(jià)值的迭代增加。它是敏捷的,靈活的過程,這個(gè)過程被構(gòu)建逐漸地朝著它的目標(biāo)前行,而不是同時(shí)應(yīng)付堆積如山的數(shù)據(jù)。
?

Cassandra 職位/公司
想獲得一份數(shù)據(jù)科學(xué)的工作并不容易。然后,O’Reilly Research的數(shù)據(jù)顯示Hadoop和Cassandra招聘啟事有著年復(fù)一年的穩(wěn)健增長,對于整體的“數(shù)據(jù)科學(xué)”市場來說是好的指標(biāo)。這張圖顯示了Cassandra工作機(jī)會隨時(shí)間的增長,和公司招聘Cassandra職位隨時(shí)間的增長。
?
This is the heart of what Patil calls “data jiujitsu”—using smaller auxiliary problems to solve a large, difficult problem that appears intractable. CDDB is a great example of data jiujitsu: identifying music by analyzing an audio stream directly is a very difficult problem (though not unsolvable—see midomi, for example). But the CDDB staff used data creatively to solve a much more tractable problem that gave them the same result. Computing a signature based on track lengths, and then looking up that signature in a database, is trivially simple.
這是Patil稱為“data jiujitsu”的重點(diǎn)——使用更小的輔助問題來解決一個(gè)看起來無法處理的大型的,困難的問題。CDDB是data jiujitsu的一個(gè)很好的例子:通過分析一個(gè)音頻流來直接識別音樂是一個(gè)非常困難的問題(雖然不是無法解決的——舉例來說,看midomi)。但是CDDB工作人員創(chuàng)造性地使用數(shù)據(jù)來解決一個(gè)好處理地多的問題,并給出了同樣的結(jié)果?;谝糗夐L度來計(jì)算前面,然后尋找數(shù)據(jù)庫中的該簽名,很平常的簡單。
?
Entrepreneurship is another piece of the puzzle. Patil’s first flippant answer to “what kind of person are you looking for when you hire a data scientist?” was “someone you would start a company with.” That’s an important insight: we’re entering the era of products that are built on data. We don’t yet know what those products are, but we do know that the winners will be the people, and the companies, that find those products. Hilary Mason came to the same conclusion. Her job as scientist at bit.ly is really to investigate the data that bit.ly is generating, and find out how to build interesting products from it. No one in the nascent data industry is trying to build the 2012 Nissan Stanza or Office 2015; they’re all trying to find new products. In addition to being physicists, mathematicians, programmers, and artists, they’re entrepreneurs.
企業(yè)家是謎題的另一部分。對于“當(dāng)你雇傭一個(gè)數(shù)據(jù)科學(xué)家時(shí),你會尋找什么樣的人”,Patil的第一個(gè)輕率的回答是“你會與之一起創(chuàng)辦一家公司的人”。那是一個(gè)重要的洞察力:我們正在進(jìn)入一個(gè)產(chǎn)品建立在數(shù)據(jù)之上的時(shí)代。我們還不知道哪些產(chǎn)品是什么,但是我們確實(shí)知道贏家會是那些找到那些產(chǎn)品的人和公司。Hilary Mason得出了同樣的結(jié)論。她作為bit.ly的科學(xué)家的工作事實(shí)上是研究bit.ly正在生成的數(shù)據(jù),并找出從中如何構(gòu)造有趣的產(chǎn)品。新生的數(shù)據(jù)產(chǎn)業(yè)沒有人試圖打造2012 Nissan Stanza或Office 2015;他們都是試圖發(fā)現(xiàn)新的產(chǎn)品。除了是物理學(xué)家,數(shù)學(xué)家,程序員和藝術(shù)家,他們是企業(yè)家。
?
Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently?interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: “here’s a lot of data, what can you make from it?”
數(shù)據(jù)科學(xué)家結(jié)合了耐心,逐步構(gòu)造數(shù)據(jù)產(chǎn)品的意愿,探索的能力,迭代解決問題的能力。他們內(nèi)在是跨學(xué)科的。他們可以應(yīng)付一個(gè)問題的所有方面,從初始的數(shù)據(jù)收集和數(shù)據(jù)調(diào)節(jié)到得出結(jié)論。他們能跳出局限想考,帶來新的方法來審視問題,或處理定義很寬泛的問題:“這里有很多數(shù)據(jù),你能從中制作出什么?”。
?
The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success. They were the vanguard, but newer companies like bit.ly are following their path. Whether it’s mining your personal biology, building maps from the shared experience of millions of travellers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data. The part of Hal Varian’s quote that nobody remembers says it all:
未來屬于成功地弄清楚如何收集和使用數(shù)據(jù)的公司。谷歌,亞馬遜,臉書和LinkedIn都已經(jīng)開發(fā)它們的數(shù)據(jù)流并使之稱為他們成功的核心。它們是先驅(qū)者,單項(xiàng)bit.ly這樣的新公司正在追趕它們的道路。無論它是否在挖掘你的個(gè)人生理信息,從數(shù)百萬旅行者的分享經(jīng)驗(yàn)中構(gòu)建地圖,或研究人們傳遞給其它人的URL,成功業(yè)務(wù)的下一代會圍繞數(shù)據(jù)構(gòu)建。Hal Varian名言中那句沒有人記得的部分完整的闡述了它:
?
The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades.
處理數(shù)據(jù)的能力——有能力去理解它,去處理它,從中提取價(jià)值,將其可視化,與它溝通——在下一個(gè)十年,那會是一個(gè)極其重要的技能。
?
Data is indeed the new Intel Inside.
數(shù)據(jù)確實(shí)是新的英特爾。
?