育碧中國AI&數(shù)據(jù)實驗室總監(jiān)Alexis Rolland:建立Metaverse,需要六大基礎要素

在Metaverse世界中,每個人和物都將擁有虛擬角色。他們在其中能得到與現(xiàn)實生活相似乃至超越現(xiàn)實的身心體驗,例如玩游戲、看電影、藝術創(chuàng)作,和購物。
In the Metaverse, everyone and everything will get a virtual representation. Players can live experiences similar to or even beyond real life such as playing games, going to the movies, create or shop.
與當下互聯(lián)網(wǎng)社交頭像有所不同,虛擬角色能夠幫助人類個體在虛擬世界實現(xiàn)獨一無二的屬性。通過實現(xiàn)臉部五官、情緒表情、手勢姿態(tài)變化,從而提升互動感和真實感。
Unlike the current Internet social avatars, virtual avatars determine the uniqueness of individual humans in the virtual world. Through the realization of facial features, emotional expressions, gestures and posture changes, so as to enhance the sense of interaction and realism.
?
換句話說,虛擬角色是人類通往虛擬世界的通行證,是人類在虛擬世界的身份標識。
In other words, the virtual avatar is the passport of mankind to the virtual world and the identity of mankind in the virtual world.
?
很明顯,定制化游戲角色和用戶原創(chuàng)內容是元宇宙的重要支柱,不過創(chuàng)作虛擬角色是一個復雜的過程,也會產(chǎn)生諸多需要攻克的難題。
Obviously, customized game characters and user-generated content are important pillars of Metaverse. However, creating virtual characters is a complex process that includes a lot of challenges to overcome.
?
育碧(Ubisoft)是研發(fā)、發(fā)行與銷售互動式娛樂游戲與服務的領先企業(yè)。自1996 年在中國建立工作室以來,育碧一直站在中國游戲產(chǎn)業(yè)的前沿。
?Ubisoft is a leader in the development, distribution and sales of interactive entertainment games and services. Since establishing a studio in China in 1996, Ubisoft has been at the forefront of China's game industry.
?
目前,育碧中國在上海和成都擁有兩處工作室,1000多名來自國內外游戲制作、圖像設計、動畫、程序、人工智能、音效、測試及數(shù)據(jù)管理方面的專業(yè)人才。
And today, Ubisoft China has two studios in Shanghai and Chengdu, with more than 1,000 professionals in game production, image design, animation, programming, artificial intelligence, sound effects, testing and data management at home and abroad.
?
在構建虛擬化身方面,育碧進行了長時間的探索,并創(chuàng)建了一套行業(yè)領先的技術體系。
In terms of building virtual avatars, Ubisoft has been exploring for a long time and has come up with a set of industry-leading technology systems.
?
在引力奇點·Metaverse峰會中,育碧中國AI&數(shù)據(jù)實驗室總監(jiān)Alexis Rolland,介紹了育碧對于Metaverse的整體看法,以及他們在創(chuàng)建虛擬化身過程中所遇到的挑戰(zhàn)和解決方案。
At the Gravity Singularity Metaverse Summit, Alexis Rolland, Director of Ubisoft China AI & Data Lab, introduced Ubisoft’s definition of the Metaverse, as well as the challenges and solutions they developed in the process of creating virtual avatars.
?
以下是育碧中國AI&數(shù)據(jù)實驗室總監(jiān)Alexis Rolland演講實錄,略經(jīng)編輯:
The following is the transcript of the speech by Alexis Rolland, Director of Ubisoft China’s AI & Data Lab, with some editing:
?
首先,很榮幸來到這里。我的名字叫Alexis Rolland,是育碧中國AI&數(shù)據(jù)實驗室總監(jiān)。育碧上海和成都兩家工作室都有我們這個技術團隊的成員。我們運用機器學習,研發(fā)工具為游戲研發(fā)賦能。
My name is Alexis Rolland. Thank you for the introduction. I am the director of Ubisoft China’s AI laboratory. We are a technology team located in Shanghai and Chengdu, we do R&D and develop tools to empower game production teams with Machine Learning.
?
我們制作了許多著名的游戲項目,包括《彩虹六號:圍攻》、《刺客信條》、《孤島驚魂》、《舞力全開》等等。這個月,我們在中國發(fā)布了新款游戲《瘋狂兔子》。
我估計大家都知道育碧,至少聽說過。我們是一家游戲大廠,大家一起來看下宣傳視頻來了解一下。
I assume most of you know about Ubisoft, at least I hope you heard about us. We are a major video game developer, but just in case, I prepared a short video to remind you what are the games we're making.
?
育碧于是最早來中國投資的外資游戲廠商之一,1996年,成立了上海工作室。2008年,育碧在中國的第二家工作室在成都成立。
Ubisoft was actually one of the first video game developer to enter China as early as 1996. When we opened our first studio in Shanghai. Later on, we opened the second studio in Chengdu in 2008.
?
如今,兩家工作室擁有超過1000多名員工,在全球論及研發(fā)團隊規(guī)模,育碧中國是第三大的。我們參與了知名IP的研發(fā),《彩虹六號:圍攻》、《刺客信條》、《孤島驚魂》、《舞力全開》,以及這個月要在中國發(fā)新作的《瘋狂兔子》
?
And today, both studios combined include more than 1,000 employees, which makes it the third biggest, creative force of the company. We work on its most famous franchises, including Rainbow Six: Siege,?Assassins Creed,?Far Cry,?Just Dance, and Rabbids?for which we are actually releasing a new game in China this month.
?
在今天的演講中,我將主要講述四個方面的內容。首先,我想再介紹一下Metaverse,或者說是我們育碧對Metaverse的理解。然后,我將和大家探討和創(chuàng)建虛擬角色有關的三個常見的挑戰(zhàn),面部動畫生成、動作動畫生成和動畫合成。
In today's presentation will cover four parts. Yet, again, an introduction about the Metaverse or at least how we define it in Ubisoft. Then I will address three common challenges related to the creation of virtual avatars. In particular facial animation, body animation and animation blending.
?
那么首先,讓我們講講Metaverse。
?
在育碧,我們把Metaverse定義為與現(xiàn)實世界平行的強化虛擬世界,在這個世界中玩家能夠以虛擬化身的形象做他們在現(xiàn)實生活中可以做的幾乎所有事情。這包括玩游戲、聽音樂會、看電影,藝術創(chuàng)作或購物。
?
But let's start with the Metaverse
?
We define the Metaverse as enhanced virtual world parallel to the real world, where players can use a personalized avatar and do almost everything they could do in real life. That includes playing video games, but also going to concert or movies and even create or shop.
?
我們認為Metaverse的建立和存在依賴于六大基礎要素。
We think the Metaverse relies on six foundational pillars.
?
首先也是最重要的是社會屬性。首先,Metaverse是一個社交場所,玩家能夠通過互動關系進行互動,這對現(xiàn)實生活中的社交而言是一種補充甚至是替代。
The first one being socialization. First and foremost the Metaverse is a social hub where players have the opportunity to interact through engaging relationships, which complement or even replace real life socialization.
?
第二是持續(xù)化。因為在玩家斷開連接后,這個虛擬世界仍在持續(xù)運作,它不依賴于玩家而存在,能在沒有玩家的情況下存續(xù)下去。
Second pillar is persistency. Because the Metaverse keeps on going after the players disconnect, it does not rely on the player's presence and keeps living on without him.
?
第三是UGC創(chuàng)作,以及數(shù)字內容生產(chǎn)。在Metaverse中,由于易獲得和易操作的工具的存在,玩家有能力與這個虛擬世界進行互動,并參與創(chuàng)建工作。也就是說,Metaverse實際上模糊了創(chuàng)作者和玩家之間的界限。
Third pillar is user-generated content and digital content creation in general. In particular because in the Metaverse, players should be able to interact and contribute to the digital universe. Thanks to easy accessible tools. The Metaverse actually blurs the line between the creators and the players.
?
第四是媒體融合。Metaverse也是一個不同種類媒體的共生共存之地。在這里,玩家可以得到充分的跨媒體體驗,比如繪畫、藝術、電影等等。玩家可以擁有從藝術、音樂到電影的跨媒體體驗。
4th pillar is the convergence of media. The Metaverse is also a place where different media industries can coexist and players are invited to live cross media experiences around art, music, or movies.
?
第五是自有經(jīng)濟體系。因為在Metaverse中,玩家應該能夠掙錢,或者能獲得被系統(tǒng)認可和被其他玩家重視的技能。我認為在這一方面上,區(qū)塊鏈和NFT等技術將會發(fā)揮重要作用。
5th pillar is an integrated and functional economy. In the Metaverse, players should have the opportunity to earn money or to acquire skills that are recognized by the system and valued by other players. This is in particular where technologies like blockchain and NFT can play a big role.
?
最后是可擴展性。只有擁有了技術的可擴展性,Metaverse才能把大量的玩家聚集在一個服務器上共享生活,而不是讓玩家分散在多個服務器上各自為營。
Finally, scalability, because the Metaverse depends on the scalability of the technology to allow a great number of players to congregate on a single server and share moments together instead of playing on multiple servers.
?
AI將發(fā)揮哪些作用?在這里,我說的不是游戲機器人AI,而是指機器學習和深度學習技術。
Now, when looking at this big picture, you may wonder where does Artificial Intelligence fits? Here I'm not talking about AI in the sense of game bots, but of course about machine learning and deep learning techniques.
?
當我們談到AI,我們會把它比喻為電力,因為它的作用將是顛覆性的。今天,我的演講重點是數(shù)字內容創(chuàng)作。
When we think about it, AI is a little bit like electricity. It actually has the potential to revolutionize all of these domains. But in today's presentation, I'd like to focus on the digital content creation.
?
尤其是在Vtuber和內容創(chuàng)作者身上,我們看到了這個趨勢,他們在為Metaverse作準備。他們購置更昂貴的硬件。你可以在這里的圖片上動作捕捉服裝、頭顯等等。他們創(chuàng)建自己的虛擬角色,另一個自我。
We see, in particular, this big trend going on about Vtubers and content creators. They are preparing themselves for the Metaverse. They equip themselves with relatively expensive hardware. You can see on the pictures here motion captures suits, headsets and so on, and they create their own digital avatar, their alter ego.
?
但要做到這一點,除了硬件投資之外,3D和動畫的技能也很重要。尤其是面部動畫是一個挑戰(zhàn)。
To achieve that, beyond the hardware investment, it is also still very demanding in terms of skills in 3D and animations. Facial animation in particular is a Challenge.
?
對于一個虛擬化身來說,要想看起來好,要想看起來逼真,它需要在語言、所承載的情感和面部動畫,包括眼睛、眉毛、嘴唇等之間實現(xiàn)完美的匹配。
For a virtual avatar to look good, to look realistic, it needs a perfect match between the speech, the emotion it carries and the animation of the face, including the eyes, the eyebrows, lips, and so on.
?
在電子游戲的背景下,面部動畫實際上是相當昂貴的,特別是當你將游戲本地化為不同的語言時。在我們的案例中,我們支持在9到10種語言的本地化。相同的內容用不同語言來說,時長也不一樣。
In the context of video game, facial animation can also be pretty expensive. In particular, when you localize games into different languages, in our case, in Ubisoft we localize voice in our games in 9 to 10 languages. And different languages speech have different duration.
?
例如,德語經(jīng)常有很長的句子和單詞,相比之下,英語要更短一些,因此相對應的唇部動作也不一樣。虛擬人物必須根據(jù)語言特點制作面部動畫。手動制作這些面面部動畫,成本就會很高。
For example, German is famous for having very long sentences and words, and so you need the animation of the mouth to be perfectly in sync with the speech. In comparison, English would be shorter, and so the virtual character has to adapt to those different languages. Creating those facial animations manually would be expensive.
?
針對這一情況,在育碧有個名為La Forge的團隊一直在研究這個問題的解決方案。他們在語音數(shù)據(jù)的基礎上訓練神經(jīng)網(wǎng)絡,該網(wǎng)絡接收包含對話的音頻文件,并輸出一個序列的音素。
So our teams in Ubisoft, in particular, Ubisoft La Forge has been working on a solution for this problem. They trained a convolutional neural network based on speech data, the network takes as input an audio file, which contains dialogue lines and outputs a sequence of phonemes.
?
對于不深究修辭學的人來說,音素就是聲音單位,每個音素有著相對應的口型。
For people who are not into linguistics, phonemes are actually units of sounds to which we can map a shape of the mouth.
?
然后,這個序列的音素被轉化成嘴唇和嘴巴的動畫,對于懂行的人來說,就是所謂的f-曲線。
And then this sequence of phonemes is converted into lips and mouth animation also known as f-curves for people who are familiar with the domain.
?
我們剛才說了面部,那身體呢?現(xiàn)在在學術界有一個很大的、非常熱門的研究課題,我們稱之為 "pose estimation(人體姿態(tài)估計)"。這項研究試圖根據(jù)二維圖像或視頻生成人體的三維坐標,即骨骼的不同關節(jié)。這是一個非常困難的研究課題。德國Max Planck研究所曾發(fā)表了一篇非常先進的論文。
We talked about face, but what about the body? There is this very hot research topic happening right now in the academia, which we call pose estimation. It's a research which consists in trying to generate 3D coordinates of the human body, the different joints of the skeleton, based on 2D images or videos. What you see here is actually not from Ubisoft. It's a state of the art paper published by the Max Planck Institute in Germany.
?
他們不僅開發(fā)出一種技術來預測身體的坐標和三維模型,而且試圖預測生成網(wǎng)格和身體的形狀,我們把它稱為身體姿勢和形態(tài)估計。這種研究對于像我們這樣的游戲行業(yè)團隊來說是非常有啟發(fā)的。
In their case, not only they developed a technique to predict the coordinates of the body, but also the 3D model. We call it a body pose and shape estimation. This kind of research is very inspiring for teams like us in the video game industry.
?
于我們而言,在育碧中國,我們從事大量的野生動物研發(fā),多年來潛心研發(fā)《孤島驚魂》系列,打造了其中非常知名的動物角色。對人類進行動補是一項挑戰(zhàn),更不要說野生動物,什么熊、大象或是老虎。因此,我們決定充分利用團隊之前的工作成果。
But in our case, in Ubisoft China, we work a lot on animals, on the wildlife. We have a long history working on the Far Cry brand for which we develop its most iconic animals. The research I was mentioning on humans is already challenging because it is difficult to acquire motion, capture data for humans. But think about wild animals like a bear, an elephant or a tiger. It's even more difficult to acquire such data to train an AI. So the idea we had was to actually leverage previous work done by our teams.
?
在過去八年里,我們的動畫師都是手K動物動畫的,管叫關鍵幀動畫。我們用它來生成訓練數(shù)據(jù),來獲得人類動補差不多的一個數(shù)據(jù)。這做法解釋起來就是建立一個從輸入2D視頻得到動物骨骼三維坐標的生產(chǎn)管線。
Over the last 8 years. They created many animals animations manually. We call it key frame animation. We use it to generate training data for achieving similar results as what you saw on humans. The idea is to build a pipeline that takes as input a video and a template skeleton and then generate as an output, the 3D coordinate of the animal skeleton.
?
我們最終成功建立了生產(chǎn)管線,將實現(xiàn)工具成為Zoobuilder。你可以看到這里的不同組件。輸入2D視頻,它將視頻轉換為一連串的圖像,并將這些圖像提供給第一個機器學習模型,該模型負責定位圖像上的動物。
We eventually built that pipeline which we call ZooBuilder. You can see the different components here. It takes as input a video. It converts that video to a sequence of images and provide those images to a first machine learning model that locates the animal on the image.?
?
然后,我們將裁剪過的圖像序列提供給第二個機器學習模型,這個模型是用我們的人工動物動畫數(shù)據(jù)反復訓練過的。
Then we provide this sequence of cropped images to a second machine learning model that we retrained with our synthetic data, with our animal animation data.?
第二個機器學習模型輸出圖像上升級的2D坐標,我們把這些2D坐標提供給第三個機器學習模型,它負責把2D坐標轉換為3D。最后,3D坐標被應用在一個3D模型上。
This second model outputs the 2D coordinates of the skeleton on the image. We then provide those 2D coordinates to a third model, which converts these two 2D coordinates to 3D. Finally, the 3D coordinates are applied on a 3D model.?
?
結果很不錯,但是還沒有投入生產(chǎn),只是停留在研發(fā)階段。因為動畫效果仍然有一點不完美,而且要讓它也能應用于其它動物,還是一項挑戰(zhàn)。
It is showing promising results but to be completely transparent, it is not used in production yet. It is still very much a research topic, because the animation is still a little bit imperfect, and it's also a challenge to make it scale for a lot of different animals.
?
但是這種技術絕對可以幫助創(chuàng)造更多的基于2D視頻的動畫片段,而不是使用動作捕捉基礎設施或動作捕捉硬件,無論是動物還是人類。?
But this kind of techniques can definitely help to create more animation clips based on 2D videos, rather than using motion capture infrastructure or motion capture hardware, whether it be for animals or humans.?
?
解決了這些問題以后,我們仍然面臨很多挑戰(zhàn)。為了在Metaverse中的制作虛擬角色,我們還只是走在半道上。這些動畫需要被整合,需要被組合在一起。因此,我想再談一談動畫合成。
Now creating those animation clips, is actually just half way through animating virtual characters in the Metaverse. Those animations need to be integrated, need to be combined together. This is why I want to talk a bit about animation blending.
?
我先解釋一下在傳統(tǒng)方式下是如何做的。一個動畫師通常會開發(fā)一個我們稱之為動畫圖或動畫樹的東西,它由不同的動畫片段、不同的葉子組成。根據(jù)玩家的輸入或自動輸入,虛擬角色將在該圖中移動并播放這些動畫。
I'll explain first how it's done in the traditional way. An animator would usually develop what we call an animation graph or an animation tree, which is composed of different leaves corresponding to different clips of animation. Based on players input or based on automated input, the virtual character is activating animations through that graph and play those animations.
?
為了使它看起來漂亮流暢,有一個要求,即第一個動畫片段的最后一幀動畫,應該與第二個動畫片段的第一幀動畫相匹配。因此,行走周期的結束應該與繞行的開始或跳躍動畫的開始相匹配。如果不匹配的話,動畫看起來就會有一點不連貫或不自然。
For this to look nice, there is a requirement, which is the last animation frame of the first animation clip, should match with the first animation frame of the second animation clips. So the end of the walk cycle should match with the beginning of the run cycle or the beginning of the jump animation. In case it doesn't match, the animation will look a little bit jittery and unnatural.
?
下面是一個例子(播放動畫)。正如你所看到的,這個動畫有一點不連貫。它不時地跳動,破壞了沉浸感。
Here is an example how it looks. You have this character walking on a plane. As you can see, the animation is a little bit jittery. It is jumping from time to time, which is breaking the immersion.
?
解決這個問題的技術是添加更多的動畫,過渡動畫來完成動畫的空隙。但這種技術會變得非常復雜,非常難以管理。它的復雜性在不斷增加。你加的動畫越多,難度越大。
A technic to solve this is to add more animations, transition animations to complete the gaps between animation clips. But this kind of technique can become very complex and difficult to manage. The more animations you add, the more difficult it is to maintain the graph.
?
所以解決這個問題的技術方案就是我們所說的動作匹配。它會把所有這些動畫片段,放在一個內存數(shù)據(jù)庫中。之后,你可以通過一個搜索算法,根據(jù)玩家的輸入,如角色、腳在地面上的位置、軌跡來搜索最佳匹配的動畫幀,然后將該幀提供給游戲引擎。它的工作效果相當好。
Another technic to solve this challenge is what we call motion matching. It consists in taking all those animation clips, putting them all?in a memory database, and then you have a search algorithm, which based on players input, such as the characters’feet position, the trajectory is going to search for the best matching animation frame, and then provide the frame to the game engine. This works fairly well.
?
(播放動畫)這是與之前相同的例子,但有了動作匹配,你可以看到角色動畫更流暢了。這里的挑戰(zhàn)是,數(shù)據(jù)庫里動畫和工作績效是成正比的。如果你想要更豐富的動作,你就需要更多的動畫,然后內存和計算成本就會變高。
This is the same example as before, but with motion matching activated. You can see the character animation is a lot smoother. The challenge here is this kind of technic scales linearly as you add more animation in the database. If you want more diversity, you need to add more animation which increase the memory and compute requirements.
?
我們的團隊在育碧軟件中采用了一種新的技術手段,靈感來源于運動匹配,我們稱之為學習運動匹配。在這里我們實際上用一個神經(jīng)網(wǎng)絡取代了搜索,這個神經(jīng)網(wǎng)絡已經(jīng)被訓練過,可以根據(jù)角色的站位和其它輸入來輸出這些動畫幀。?
Our teams in Ubisoft worked on a new approach inspired from motion matching, which we called learned motion matching, where we actually replaced the search by a neural network, which has been trained to output those animation frames based on the characters’feet position and other inputs.
?
這種技術的好處是它的效果和傳統(tǒng)的運動匹配一樣好。你可以看一下這兩者的對比(播放視頻),右邊是傳統(tǒng)的運動匹配,左邊是學習運動匹配,動畫的質量被保留了下來。你可以看到它是相當順暢的。相比之下,學習運動匹配在內存方面的要求要低得多,幾乎是傳統(tǒng)方法的十分之一。
Good thing with this technic is that it works as good as the traditional motion matching. Here is a comparison where you see on the right side, traditional motion matching and on the left side, the learned motion matching. You can see the quality of the animation is preserved and the animation is quite smooth. But in comparison, the learned motion matching is a lot less demanding in terms of memory, almost 10 times less demanding than the traditional method.
?
好消息是,它處理動物的效果一樣好(播放動畫)。你可以看到一只熊在凹凸的地形上走路。這個動畫是用學習機器匹配生成的,它動作流暢,育碧中國團隊的使用體驗很好。
The good news is it works as well for animals, for quadrupeds. Here you see a bear walking on an uneven terrain. The animation has been generated with learned motion matching and it is perfectly smooth, which is great for our use case in China.
?
這就是我想分享的與動畫有關的挑戰(zhàn)和人工智能的應用。最后做個總結。
That's all for the challenges and the applications of AI I wanted to share related to animations. Let's try to wrap it up together.
?
首先,我們了解到了一個自動生產(chǎn)管線,可以根據(jù)語音中生成嘴唇動畫。很明顯,這比用視頻和頭顯等設備進行面部捕捉更方便。第二,我們看到了用視頻生成動畫的新興技術,同樣也是為了擺脫所有那些昂貴的硬件。最后,學習動作匹配,用機器學習來改進現(xiàn)有的動畫編程和動畫合成。
?
First, we saw an automatic pipeline that generates lips animation out of speech. Obviously, this could be a lot more convenient than a facial capture with videos, headsets and so on.?
?
Second, we saw emerging technologies and promising techniques to generate those animations out of videos. Again, to get rid of all those expensive hardware.
?
Finally, we introduced learned motion matching, which is using machine learning to improve on existing animation programming and animation blending technics.
?
坦率地說,在動畫領域,我們用機器學習來做的所有事情,實際上只是一個開始。但通過把這三個元素放在一起,我們開始看到虛擬角色動畫生產(chǎn)管線的未來,它的性能越來越好,也越來越便捷。長遠來看,這將簡化開發(fā)者的工作,讓玩家從中獲益。
?
To be frank, we are actually just scratching the surface of everything that could be done in the field of animation with machine learning. But by putting those three elements together, we start to see the future of the animation pipeline for virtual characters, which is becoming more and more performant, more and more accessible. And that is going to streamline the work of artist and hopefully players in the long run.
?
我的演講到此結束。感謝你的聆聽。
?
This is it for my presentation. Thank you for listening.