Motivation and prior work
Automatic music generation dates back to more than half a century. A prominent approach is to generate music symbolically in the form of a piano roll, which specifies the timing, pitch, velocity, and instrument of each note to be played. This has led to impressive results like producing Bach chorals, polyphonic music with multiple instruments, as well as minute long musical pieces.
But symbolic generators have limitations—they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music. A different approach is to model music directly as raw audio. Generating music at the audio level is challenging since the sequences are very long. A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies.
One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.
We chose to work on music because we want to continue to push the boundaries of generative models. Our previous work on MuseNet explored synthesizing music based on large amounts of MIDI data. Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing.
Compressing music to discrete codes
Jukebox’s autoencoder model compresses audio to a discrete space, using a quantization-based approach called VQ-VAE. Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called VQ-VAE-2 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images.
We draw inspiration from VQ-VAE-2 and apply their approach to music. We modify their architecture as follows:
- To alleviate codebook collapse common to VQ-VAE models, we use random restarts where we randomly reset a codebook vector to one of the encoded hidden states whenever its usage falls below a threshold.
- To maximize the use of the upper levels, we use separate decoders and independently reconstruct the input from the codes of each level.
- To allow the model to reconstruct higher frequencies easily, we add a spectral loss that penalizes the norm of the difference of input and reconstructed spectrograms.
We use three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. However, it retains essential information about the pitch, timbre, and volume of the audio.
Generating codes using transformers
Next, we train the prior models whose goal is to learn the distribution of music codes encoded by VQ-VAE and to generate music in this compressed discrete space. Like the VQ-VAE, we have three levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.
The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, significantly improving the audio quality.
We train these as autoregressive models using a simplified variant of Sparse Transformers. Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to approximately 24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively.
Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.
To train this model, we crawled the web to curate a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki. The metadata includes artist, album genre, and year of the songs, along with common moods or playlist keywords associated with each song. We train on 32-bit, 44.1 kHz raw audio, and perform data augmentation by randomly downmixing the right and left channels to produce mono audio.
Artist and genre conditioning
The top-level transformer is trained on the task of predicting compressed audio tokens. We can provide additional information, such as the artist and genre for each song. This has two advantages: first, it reduces the entropy of the audio prediction, so the model is able to achieve better quality in any particular style; second, at generation time, we are able to steer the model to generate in a style of our choosing.
This t-SNE below shows how the model learns, in an unsupervised way, to cluster similar artists and genres close together, and also makes some surprising associations like Jennifer Lopez being so close to Dolly Parton!
In addition to conditioning on artist and genre, we can provide more context at training time by conditioning the model on the lyrics for a song. A significant challenge is the lack of a well-aligned dataset: we only have lyrics at a song level without alignment to the music, and thus for a given chunk of audio we don’t know precisely which portion of the lyrics (if any) appear. We also may have song versions that don’t match the lyric versions, as might occur if a given song is performed by several different artists in slightly different ways. Additionally, singers frequently repeat phrases, or otherwise vary the lyrics, in ways that are not always captured in the written lyrics.
To match audio portions to their corresponding lyrics, we begin with a simple heuristic that aligns the characters of the lyrics to linearly span the duration of each song, and pass a fixed-size window of characters centered around the current segment during training. While this simple strategy of linear alignment worked surprisingly well, we found that it fails for certain genres with fast lyrics, such as hip hop. To address this, we use Spleeter to extract vocals from each song and run NUS AutoLyricsAlign on the extracted vocals to obtain precise word-level alignments of the lyrics. We chose a large enough window so that the actual lyrics have a high probability of being inside the window.
To attend to the lyrics, we add an encoder to produce a representation for the lyrics, and add attention layers that use queries from the music decoder to attend to keys and values from the lyrics encoder. After training, the model learns a more precise alignment.
While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a significant gap between these generations and human-created music.
For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat. Our downsampling and upsampling process introduces discernable noise. Improving the VQ-VAE so its codes capture more musical information would help reduce this. Our models are also slow to sample from, because of the autoregressive nature of sampling. It takes approximately 9 hours to fully render one minute of audio through our models, and thus they cannot yet be used in interactive applications. Using techniques that distill the model into a parallel sampler can significantly speed up the sampling speed. Finally, we currently train on English lyrics and mostly Western music, but in the future we hope to include songs from other languages and parts of the world.
Our audio team is continuing to work on generating audio samples conditioned on different kinds of priming information. In particular, we’ve seen early success conditioning on MIDI files and stem files. Here’s an example of a raw audio sample conditioned on MIDI tokens. We hope this will improve the musicality of samples (in the way conditioning on lyrics improved the singing), and this would also be a way of giving musicians more control over the generations. We expect human and model collaborations to be an increasingly exciting creative space. If you’re excited to work on these problems with us, we’re hiring.
As generative modeling across various domains continues to advance, we are also conducting research into issues like bias and intellectual property rights, and are engaging with people who work in the domains where we develop tools. To better understand future implications for the music community, we shared Jukebox with an initial set of 10 musicians from various genres to discuss their feedback on this work. While Jukebox is an interesting research result, these musicians did not find it immediately applicable to their creative process given some of its current limitations. We are connecting with the wider creative community as we think generative work across text, images, and audio will continue to improve. If you’re interested in being a creative collaborator to help us build useful tools or new works of art in these domains, please let us know!
To connect with the corresponding authors, please email [email protected].
Our first raw audio model, which learns to recreate instruments like Piano and Violin. We try a dataset of rock and pop songs, and surprisingly it works.
We collect a larger and more diverse dataset of songs, with labels for genres and artists. Model picks up artist and genre styles more consistently with diversity, and at convergence can also produce full-length songs with long-range coherence.
We scale our VQ-VAE from 22 to 44kHz to achieve higher quality audio. We also scale top-level prior from 1B to 5B to capture the increased information. We see better musical quality, clear singing, and long-range coherence. We also make novel completions of real songs.
We start training models conditioned on lyrics to incorporate further conditioning information. We only have unaligned lyrics, so model has to learn alignment and pronunciation, as well as singing.
Bitcoin Always Online In Venezuela: Launched The First Satellite Node In Collaboration With Blockstream
Bitcoiners in Venezuela don’t need the internet to send some Satoshis. Today, the crypto payments startup Cryptobuyer announced the successful launch of the first Bitcoin satellite node thanks to a collaboration between Cryptobuyer, Blockstream, and a team led by a crypto enthusiast named Aníbal Garrido.
The initiative allows interacting with the Bitcoin blockchain without the need of an internet connection. A satellite antenna installed in Venezuela is in charge of the communication between the node and the blockchain.
We successfully installed and run a satellite #Bitcoin node in #Venezuela which allows us to be independent of the internet to download messages and validate transactions. Thanks to @Blockstream @adam3us @richardbensberg @anibalcripto for all your support https://t.co/TUb6eG19XP
— Cryptobuyer (@cryptobuyer) September 25, 2020
How the Satellite Node Works
This novel solution allows the Venezuelan node to process information in real-time completely off-line. Thus, the normal functioning of the network in case of connectivity failure (something widespread in the country) is guaranteed. It also facilitates the use of cryptocurrencies in remote places where internet service is scarce, expensive, or even non-existent.
The project works as follows: Blockstream contracts a number of satellites to provide the communication service between the nodes and the blockchain. Cryptobuyer bought the necessary equipment to receive the signal and connect to the satellite, and Anibal Garrido and his team were in charge of assembling the antennas and making the required adjustments.
It’s been a pleasure working with @cryptobuyer and @anibalcripto to launch the first of many #BlockstreamSatellite nodes in #Venezuela, ensuring bitcoiners in the region are always connected to the Bitcoin network! 🛰⛓💻 https://t.co/hzqoR1nACI
— Blockstream (@Blockstream) September 25, 2020
For Alvaro Perez, a software programmer from Valencia City who helped set up the whole infrastructure, the node’s synchronization was an inspiring moment. In statements compiled by Cryptobuyer on an official blog post, the expert says that the operation was a “great achievement.”
“We downloaded the whole Bitcoin blockchain and successfully carried out the first transaction through a Bitcoin satellite node in our country on September 23, from the city of Valencia (…) We received bitcoin through the satellite connection without any internet connection. It was a moment of great achievement.”
The journey is just beginning for Bitcoiners in Venezuela
This would be the first of three antennas that Cryptobuyer plans to deploy to cover the country’s most critical areas. The remaining two will be placed in the country’s capital, Caracas, in the north of Venezuela, and Puerto Ordaz, an industrial city located south of the country.
Later on, they plan to deploy a large number of small devices that will serve as a sort of repeater antenna to create a sizeable mesh-type network that will facilitate transactions in Bitcoin even far away from the primary antenna.
Now there’s no excuse to start using some satoshis in the country. Venezuela keeps proving that it has plenty of reasons to be on the podium of the three countries with the most adoption of Bitcoin around the world.
KuCoin’s CEO: The $150 Million Hack Is “Small” For KuCoin, Insurance Will Cover
In a dedicated live stream, KuCoin’s CEO noted that although why he cannot reveal how much of the company’s total assets were affected during the hack, the stolen fund amount is “small for KuCoin.” The exchange will cover all the losses with its insurance fund.
- The company first noticed the abnormalities at 2:51 AM, Sept 26, when it received an alert from its internal risk-monitoring system. More alerts followed, indicating abnormal transfers from the hot wallet.
- At 3:01 AM, the exchange received an alert about its remaining balance from the monitoring system. Three minutes later, more alerts came in showing abnormal XRP withdrawal, which was followed by another alert that the company’s hot wallet is “running out of balance.”
- Subsequent alerts between 3:05 AM and 3:40 AM showed abnormal BTC withdrawal alongside other tokens.
- While the abnormal withdrawals were ongoing, the company set up an urgent task force and then shut down its wallet servers. However, the shut down did not do much to stop the hackers as the abnormal transfers continued.
- At this point, KuCoin realized that the private keys of its hot wallet had leaked. The company then started moving the remaining balance in its hot wallet to cold storage at 4:20 AM. The process took about 30 minutes to complete.
- Lyu said the exchange would publish the addresses used by the hackers on its official channels. An earlier report on the hack shows that the Ethereum address supposedly used for the operation contained over $150 million in ETH and ERC-20 tokens.
- KuCoin is now in contact and working with the international police, its largest clients, and industry experts for an in-depth investigation into the incident.
- The CEO also said they had asked most crypto exchanges, including Binance, Bitfinex, OKEx, BitMEX, and Houbi Global, to blacklist the hackers’ wallet address and assist with the investigation.
- The crypto community was quick to swing into action to assist KuCoin in its request. Bitfinex CTO Paolo Ardoino said they have already frozen 13 million USDT on EOS that was part of the hack, while Tether froze the 20m USDT on Ethereum in the ETH address used for the hack.
- While trading services are still available, withdrawals and deposits will remain closed until the exchange completes its wallet upgrade.
Bullish? On-Exchange Bitcoin Declines While Whales Accumulate (Report)
A recent report suggests that the amount of Bitcoin stored on exchanges is declining while BTC whales increase their holdings and that’s bullish for Bitcoin’s price.
The paper also highlighted that investors have a much larger time horizon for their holdings now compared to previous years.
Bitcoin Stored On Exchanges Drop
In its latest report shared with CryptoPotato on Bitcoin investors’ behavior, the popular research company Digital Delphi explored the number of bitcoins stored on cryptocurrency exchanges. The document indicated that if the BTC stock on platforms increases, it could put sell pressure.
However, this isn’t necessarily the case during bull runs, as retail investors often “leave BTC on exchanges and traders use BTC as margin collateral.” Alternatively, in case the asset price rises while the stock on exchange decreases, this typically implies an accumulation trend.
The report indicated that Bitcoin stored on exchanges marked an all-time high of 2.96 million in mid-February. Since then, the trend has reversed, and the number has dropped to below 2.6 million.
Digital Delphi argued that the reason behind this decrease of BTC on exchanges is because investors are most likely preparing for a longer-term holding period. More importantly, though, the paper highlighted a substantial decline in speculative trading interest in Bitcoin, while the HODLing mentality has increased.
“Unlike the 2019 price uptrend, which coincided with BTC stock increasing, this current trend has seen a divergence between BTC stock and price. This suggests a more sustainable move upwards for BTC, in comparison to that of 2019, as data indicates a holder base with longer time horizons.”
Bitcoin Whales Haven’t Slowed Down Accumulating
Digital Delphi’s data reaffirmed previous reports that Bitcoin whales, meaning addresses containing between 1,000 and 10,000 BTC, continue to accumulate large portions. The company outlined that whales have been on a shopping spree since the start of 2020, as their holdings have increased by 9% YTD.
Moreover, the US Federal Reserve’s actions to print extensive amounts of dollars since the start of the COVID-19 pandemic have accelerated whales’ accumulations.
“Since the USD M2 supply expansion in March, there has been a 7% increase in whale holdings.”
According to the document, this only emphasizes the narrative that Bitcoin serves as a hedge against dollar inflation, and “the smart money is clearly betting on this.” It’s worth noting that prominent US investor Paul Tudor Jones III purchased BTC earlier this year to protect himself against precisely the rising inflation.
Blockchain4 weeks ago
Market Wrap: Bitcoin’s Powell-Induced Price Swing; Ethereum Still High on Gas
Blockchain1 month ago
The US Post Office Files a Patent for a Blockchain-Based Voting System
Blockchain4 months ago
How to Identify the ‘Third Wave’ of Cannabis Investments
Blockchain3 weeks ago
Blockchain Bites: Is DeFi an Inside Deal?
Blockchain2 months ago
Wealthfront Lures Millenials With Crypto Memes and Tactics
Blockchain2 months ago
Top Five Most Advanced Cryptocurrencies
Blockchain4 months ago
5 Tips to Interest the Press in Your Cannabis Business
Blockchain3 months ago
Top 5 Most Effective Cannabis Marketing Strategies