SO-FAR

Skip to main content

Dance and Machine Learning: First Steps

Artist and coder Kyle McDonald journals his exploration of new algorithms in the context of dance.

Issue

Chapter

discrete figures

Kyle McDonald, discrete figures, 2019

In February 2018, artist and programmer  Daito Manabe  wrote me an email with the subject “Dance x Math (ML)”, asking if I’d be interested in working on a new project with choreographer  Mikiko and her high-tech dance troupe  Elevenplay . I already had some background working in the context of dance in the past, including  3D scanning with choreographer Lisa Parra  in 2010,  Reactor for Awareness in Motion  (RAM) with the Yamaguchi Centre for Arts and Motion (YCAM) in 2013, and a project called  Transcranial  with Daito and media artist-choreographer Klaus Obermaier in 2014 to 2015.



I was very excited for the possibility of working with Mikiko, and again with Daito. They shared some initial inspiration for the piece. We talked about ideas emerging from the evolution of mathematics, starting with the way bodies have been used for counting since prehistory [01], all the way through Alan Turing’s impressions of computers as an extension of human flesh, into modern attempts to categorise and measure the body with algorithms in the context of surveillance and computer vision.


After long conversations around these themes, we picked the name  discrete figures  to play on the multiple interpretations of both words. I focused on two consecutive scenes toward the end of the performance, which we informally called the “Debug” scene and “AI Dancer” scene. The goal for these two scenes was to explore the possibilities of training a machine learning system to generate dances in a style similar to Elevenplay’s improvisation. For the performance in Tokyo, we also added a new element to the Debug scene that includes generated dance sequences based on videos captured of the audience before the performance. In this writeup I’ll provide a teardown of the process that went into creating these scenes.


A body counting system from the Oksapmin people in Papua New Guinea.

A body counting system from the Oksapmin people in Papua New Guinea.


Background


There is a long history of interactive and generative systems in the context of dance. Some of the earliest examples I know come from the “ 9 Evenings ” series in 1966 [02]. During that event, for example, Yvonne Rainer presented Carriage Discreteness , where dancers interacted with lighting design, projection, and even automated mechanical elements.


More recently, there have been artist-engineers who have built entire toolkits or communities around dance. For example, Mark Coniglio developed  Isadora starting with tools he created in 1989 [03], and Frieder Weisse with Kalypso and  EyeCon starting around 1993 [04].


I’ve personally been very inspired by digital art collective  OpenEnded Group , who have been working on groundbreaking methods for visualisation and the augmentation of dancers since the late 1990s [05]. Many pieces by OpenEnded Group were produced in their custom development environment for experimental code and digital art called “ Field ”, which combines elements of node-based patching, text-based programming, and graphical editing.



AI Dancer Scene


For my work on  discrete figures  I was researching a few recent projects that apply techniques from deep learning to human movement. My biggest inspiration is called “ chor-rnn ” by software engineer Luka and his partner, Louise Crnković-Friis , a choreographer who works with creative artificial intelligence [06].


In chor-rnn, they first collected five hours of data from a single contemporary dancer using a Kinect v2 for motion capture. Then they processed the data with a common neural network architecture called long short-term memory or LSTM. This is a recurrent neural network architecture (RNN), which means it is designed for processing sequential data, as opposed to static data like an image. The RNN processes the motion capture data one frame at a time, and can be applied to problems like dance-style classification or dance-generation. “chor-rnn” is a pun from “ char-rnn ”, a popular architecture in code that is used for analysing and generating text one character at a time.


This data is very different than most existing motion capture datasets, which are typically designed for making video games and animation. Other research on generative human motion is also typically designed towards this end, for example, “ Phase-Functioned Neural Networks for Character Control ” from the University of Edinburgh takes data from a joystick, the 3D terrain, and the walk cycle phase, then outputs character motion. This neural network allows a character to traverse a variety of terrain, producing appropriate, expressive locomotion according to real-time user control and the geometry of the environment.



In contrast, for  discrete figures , we were more interested in things like the differences between dancers and styles, and how rhythm in music is connected to improvised dance. We collected around 2.5 hours of data with a Vicon motion capture system at 60fps in 40 separate recording sessions. Each session was composed of one of eight dancers improvising in a different style: “robot”, “sad”, “cute”, etc. The dancers were given a 120bpm beat to keep consistent timing. For our first exploration in this direction, we worked with computational artist  Parag Mital  to craft and train a network called  dance2dance . This network is based on the  seq2seq architecture from Google, which is similar to char-rnn in that it is a neural network architecture that can be used for sequential modelling.


Typically, seq2seq is used for modeling and generating language [07], but we modified it to handle motion capture data. Based on results from chor-rnn, we also used a technique called “ mixture density networks ” (MDN). MDN allows us to predict a probability distribution across multiple outcomes at each time step. When predicting discrete data like words, characters, or categories, it’s standard to predict a probability distribution across the possibilities. But when you are predicting a continuous value, like rotations or positions, the default is to predict a single value. MDNs  give us the ability to predict multiple values , and the likelihood of each, which allows the neural network to learn a more complex structure to the data. Without MDNs the neural network either overfits and copies the training data, or it generates “average” outputs.


One big technical question we had to address while working on this was how to represent dance. By default, the data from the motion capture system is stored in a format called Biovision hierarchical data, or  BVH , which provides a skeletal structure with fixed length limbs and a set of position and rotation offsets for every frame. The data is mostly encoded using rotations, with the exception of the hip position, which is used for representing the overall position of the dancer in the world. If we were able to generate rotation data with the neural net, then we could generate new BVH files and use them to transform a rigged 3D model of a virtual dancer. chor-rnn uses 3D position data, which means that it is impossible to distinguish between something like an outstretched hand that is facing palm-up vs palm-down, or whether the dancer’s head is facing left vs right. There are some other decisions to make about how to represent human motion:


  • The data: position or rotation.
  • The representation: for position, there are cartesian and spherical coordinates. For rotation, there are  rotation matrices quaternions Euler angles , and  axis-angle .
  • The temporal relationship: temporally absolute data, or difference relative to the previous frame.
  • The spatial relationship: spatially absolute data, or difference relative to the parent joint.

Each of these have different benefits and drawbacks. For example, using temporally relative data “centres” the data, making it easier to model (this approach is used by  David Ha  for  sketch-rnn ), but when generating the absolute position, it can slowly drift. Using Euler angles can help decrease the amount of variables to model, but angles wrap around in a way that is hard to model with neural networks. A similar problem is encountered when using neural networks to model the phase of audio signals.


In our case, we decided to use temporally and spatially absolute quaternions. Initially we had some problems with wrap-around and quaternion flipping, because quaternions have two equivalent representations for any orientation, but it is possible to constrain quaternions to a single representation.


Before training the dance2dance network, I tried some other experiments on the data. For example, training a  variational autoencoder  (VAE) to “compress” each frame of data. In theory, if it’s possible to compress each frame, then it is possible to generate in that compressed space instead of worrying about modelling the original space. When I tried to generate using a 3-layer LSTM trained on the VAE-processed data, the results were incredibly “shaky”. (I assume this is because I did not incorporate any requirement of temporal smoothness, and the VAE learned a very piecemeal latent space capable of reconstructing individual frames instead of learning how to expressively interpolate.) In the video below, the original data is visualised on the left and the VAE “compressed” data is on right. The left figure moves smoothly, and the right copies the left but with less enthusiasm.



After training the dance2dance network for a few days, we started to get output that looked similar to some of our input data. The biggest difference across all these experiments is that the hips were fixed in place, making it look sort of like the generated dancer was flailing around on a bicycle seat. The hips were fixed because we were only modelling the rotations and didn’t model the hip position offset. As the deadline for the performance drew close, we decided to stop the training and work with the model we had. The network was generating a sort of not-quite-human movement that was still somehow reminiscent of the original motion, and it felt appropriate for the feeling we were trying to create in the performance.


The network was generating a sort of not-quite-human movement that was still somehow reminiscent of the original motion, and it felt appropriate for the feeling we were trying to create in the performance.


During the performance, the real dancer from Elevenplay named MARUYAMA Masako, or Maru for short, began the scene by exploring the space around the AI dancer, keeping her distance with a mixture of curiosity and suspicion. Eventually, Maru attempted to imitate the AI dancer. For me, this was one of the most exciting moments as it represented the transformation of human movement that had passed through a neural network, and was once again embodied by a dancer. The generated motion was then interpolated with the choreography to produce a slowly-evolving duet between Maru and the AI dancer. During this process, the rigged 3D model “came to life” and changed from a silvery 3D blob to a textured dancer. It represented the way that life emerges when creative expression is shared between people; the way that sharing can complete something otherwise unfinished. As the scene ended, the AI dancer attempted to exit the stage, but Maru backed up in the same direction with palm outstretched towards the AI dancer. The AI dancer transformed back into the silvery blob and it was left writhing alone in its unfinished state, without a real body or any human spirit to complete it.



Debug Scene


The Debug scene preceded the AI Dancer scene and acted as an abstract introduction. For the Debug scene, I compiled a variety of different data from the training and generation process, and presented it as a kind of landscape for exploring. There were four main elements to the Debug scene, and it was followed by a collection of data captured from the audience before the performance.



First, in the centre is the generated dancer, including the skeleton and rigged 3D model. Secondly, covering the generated dancer are a set of rotating cubes, representing rotations of each of most of the joints in the model. The third element on the left and right are 3D point clouds based on the generated data. Each point in the point clouds corresponds to a single frame of generated data. One point cloud represents the raw rotation data, and the other point cloud represents the state of the neural network at that moment in time. The point clouds are generated using a technique called Uniform Manifold Approximation and Projection or  UMAP  by Leland McInnes.


Plotting 2D or 3D points is easy, but when you have more than 3 dimensions it can be hard to visualise. UMAP helped with this problem by taking a large number of dimensions (like all the rotation values of a single frame of dance data) and creating a set of 3D points that has a similar structure. This means points that are close in the high-dimensional space should be close in the low-dimensional 3D space. Another popular algorithm for this is  t-SNE .


It represented the transformation of human movement that had passed through a neural network, and was once again embodied by a dancer.


The final and fourth element is the large rotating cube in the background made of black and white squares. This is a reference to a traditional technique for visualising the state of neural networks, called a Hinton Diagram [08]. In these diagrams, black squares represent negative numbers and white squares represent positive numbers, and their size corresponds to the value. Historically, these diagrams were helpful for quickly checking and comparing the internal state of a neural network by hand. In this case, we are visualising the state of the dance2dance network that is generating the motion.


Dance and machine


The ending sequence of the Debug scene was based on data collected just before each performance. The audience was asked to dance for one minute in front of a black background, one person at a time. We showed a sample dance for inspiration and showed realtime pose tracking results to help the audience understand what data was being collected. This capture booth was built by Asai Yuta and Mori Kyōhei, and the sample dance featured a rigged model of Maru rendered by Rhizomatiks.


With each audience member, we uploaded their dance video to a remote machine that analysed their motion using the open-source detection library  OpenPose . On performance days we kept 16 p2.xlarge AWS instances alive and ready to ingest this data, automated by the programmer  2bit .


After analysing their motion, we trained an architecture called  pix2pixHD  to generate images from the corresponding poses. While pix2pixHD is typically available under a non-commercial license, NVIDIA granted us an exception for this performance. Once pix2pixHD was trained, we could then synthesise “fake” dance videos featuring the same person. This process was heavily inspired by the AI dance project “ Everybody Dance Now ” by Caroline Chan et al.



In our case, we synthesised the dance  during  the training process. This means the first images in the sequence look blurry and unclear, but by the end of the scene they start to resolve into more recognisable features. During the first half of this section we show an intermittent overlay of the generated dancer mesh, and during the second half we show brief overlays of the best-matching frame from the original video recording. The pose-matching code was developed by Asai.


While most of  discrete figures  runs in realtime, the Debug scene is pre-rendered in openFrameworks and exported as a video file to reduce the possibilities of something going wrong in the middle of the show. Because the video is re-rendered for every show, a unique kind of time management was required that allowed us to include up to 15 audience members in each performance:


  • The doors for the show open one hour before each performance.
  • Each audience member records for one minute with some time before and after for stepping up and walking away.
  • We train and render using pix2pixHD for 15 minutes per person (including the video generation and file transfer from AWS to the venue).
  • It takes 12 minutes to render the video from the generated videos.
  • We must hand off the video for final checks 15 minutes before the lights dim (as the audience is being seated).

While Maru had a chance to experience the process of her movement data being reimagined by machine, this final section of the Debug scene also gave the audience a chance to have the same feeling. It followed a recurring theme throughout the entire performance: seeing your humanity reflected in the machine, and vice versa.


Generated images based on video of audience members using pix2pixHD.

Generated images based on video of audience members using pix2pixHD.


Future


Next, we will be exploring other data representations, other network architectures, and the possibility of conditional generation (generating dances in a specific style or from a specific dancer, or to a specific beat) and classification (determining each of these attributes from input data, for example, following the rhythm of a dancer with generative music). While the training process for these architectures can take a long time, once they are trained, the evaluation can happen in realtime, opening up the possibility of using them in interactive contexts.


discrete figures  is almost an hour long, and this article only describes a small piece of the performance. This article was first published on Kyle McDonald’s Medium blog and edited for clarity.


  • 01.

    In his work “Performing the Smart Nation”, performance and new media artist Teow Yue Han also refers to dance theorist Rudolf Laban’s work in the early 1900s which laid the foundation for dance notation and movement analysis. Laban’s notation system was later developed by using abstract symbols to describe the direction, duration and dynamic quality of the body’s movements. Read the interview with Teow here: https://www.so-far.online/performing-the-smart-nation-the-evolving-practice-of-teow-yue-han/

  • 02.

    9 Evenings: Theatre and Engineering was a series of 10 performances in 1966 initiated by artist Robert Rauschenberg and electrical engineer Billy Klüver. Multidisciplinary artists including John Cage, Yvonne Rainer and Robert Whitman, and engineers from Bell Laboratories including John Pierce and Béla Julesz, collaborated in a hybrid of new technologies, dance and avant-garde theatre. The event was later expanded to a series of projects that would become known as Experiments in Art and Technology (E.A.T.)

  • 03.

    Named after modern dance pioneer Isadora Duncan, Isadora is an interactive software that can manipulate video and sound in real time, enabling performance artists, designers and multi-media creatives to craft unique environments.

  • 04.

    EyeCon and Kalypso are video motion sensor programmes especially designed around dance, music and computer art.

  • 05.

    OpenEndedGroup comprises two digital artists, Marc Downiest and Paul Kaiser, who have been collaborating across a range of forms and disciplines since 2001. Notable collaborators include Merce Cunningham, Bill T. Jones and Trisha Brown, amongst others.

  • 06.

    Luka Crnković-Friis is also an entrepreneur and CEO of Swedish startup, Peltarion, which builds neural networks for corporate clients.

  • 07.

    seq2seq is a general-purpose encoder-decoder framework for the open-source software library Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more. After being used in Google Translate, it was distributed on GitHub in 2016 by Google. Read more: https://google.github.io/seq2seq/

Artists and Contributors

 Kyle McDonald picture

Kyle McDonald

Kyle McDonald is an artist working with code. He is a contributor to open source arts-engineering toolkits like openFrameworks, and builds tools that allow artists to use new algorithms in creative ways. McDonald creatively subverts networked communication and computation, explores glitch and systemic bias, and extends these concepts to a reversal of everything from identity to relationships. Kyle has also been an adjunct professor at NYU's ITP, and a member of F.A.T. Lab, and artist in residence at STUDIO for Creative Inquiry at Carnegie Mellon, as well as YCAM in Japan. His work is shown at exhibitions and festivals around the world. McDonald is based in Los Angeles, USA.