• Skip to main content

Biz Builder Mike

You can't sail Today's boat on Yesterdays wind - Michael Noel

  • Tokenomics is not Economics – Digital CX -The Digital Transformation Chapter 1
  • Resume / CV – Michael Noel
  • Contact Us
  • Featured

AI

Apr 02 2023

A four-legged robotic system for playing soccer on various terrains

If you’ve ever played soccer with a robot, it’s a familiar feeling. Sun glistens down on your face as the smell of grass permeates the air. You look around. A four-legged robot is hustling toward you, dribbling with determination. 

While the bot doesn’t display a Lionel Messi-like level of ability, it’s an impressive in-the-wild dribbling system nonetheless. Researchers from MIT’s Improbable Artificial Intelligence Lab, part of the Computer Science and Artificial Intelligence Laboratory (CSAIL), have developed a legged robotic system that can dribble a soccer ball under the same conditions as humans. The bot used a mixture of onboard sensing and computing to traverse different natural terrains such as sand, gravel, mud, and snow, and adapt to their varied impact on the ball’s motion. Like every committed athlete, “DribbleBot” could get up and recover the ball after falling. 

Programming robots to play soccer has been an active research area for some time. However, the team wanted to automatically learn how to actuate the legs during dribbling, to enable the discovery of hard-to-script skills for responding to diverse terrains like snow, gravel, sand, grass, and pavement. Enter, simulation. 

A robot, ball, and terrain are inside the simulation — a digital twin of the natural world. You can load in the bot and other assets and set physics parameters, and then it handles the forward simulation of the dynamics from there. Four thousand versions of the robot are simulated in parallel in real time, enabling data collection 4,000 times faster than using just one robot. That’s a lot of data. 

Video thumbnail

Play video

Video: MIT CSAIL

The robot starts without knowing how to dribble the ball — it just receives a reward when it does, or negative reinforcement when it messes up. So, it’s essentially trying to figure out what sequence of forces it should apply with its legs. “One aspect of this reinforcement learning approach is that we must design a good reward to facilitate the robot learning a successful dribbling behavior,” says MIT PhD student Gabe Margolis, who co-led the work along with Yandong Ji, research assistant in the Improbable AI Lab. “Once we’ve designed that reward, then it’s practice time for the robot: In real time, it’s a couple of days, and in the simulator, hundreds of days. Over time it learns to get better and better at manipulating the soccer ball to match the desired velocity.” 

The bot could also navigate unfamiliar terrains and recover from falls due to a recovery controller the team built into its system. This controller lets the robot get back up after a fall and switch back to its dribbling controller to continue pursuing the ball, helping it handle out-of-distribution disruptions and terrains. 

“If you look around today, most robots are wheeled. But imagine that there’s a disaster scenario, flooding, or an earthquake, and we want robots to aid humans in the search-and-rescue process. We need the machines to go over terrains that aren’t flat, and wheeled robots can’t traverse those landscapes,” says Pulkit Agrawal, MIT professor, CSAIL principal investigator, and director of Improbable AI Lab.” The whole point of studying legged robots is to go terrains outside the reach of current robotic systems,” he adds. “Our goal in developing algorithms for legged robots is to provide autonomy in challenging and complex terrains that are currently beyond the reach of robotic systems.” 

The fascination with robot quadrupeds and soccer runs deep — Canadian professor Alan Mackworth first noted the idea in a paper entitled “On Seeing Robots,” presented at VI-92, 1992. Japanese researchers later organized a workshop on “Grand Challenges in Artificial Intelligence,” which led to discussions about using soccer to promote science and technology. The project was launched as the Robot J-League a year later, and global fervor quickly ensued. Shortly after that, “RoboCup” was born. 

Compared to walking alone, dribbling a soccer ball imposes more constraints on DribbleBot’s motion and what terrains it can traverse. The robot must adapt its locomotion to apply forces to the ball to  dribble. The interaction between the ball and the landscape could be different than the interaction between the robot and the landscape, such as thick grass or pavement. For example, a soccer ball will experience a drag force on grass that is not present on pavement, and an incline will apply an acceleration force, changing the ball’s typical path. However, the bot’s ability to traverse different terrains is often less affected by these differences in dynamics — as long as it doesn’t slip — so the soccer test can be sensitive to variations in terrain that locomotion alone isn’t. 

“Past approaches simplify the dribbling problem, making a modeling assumption of flat, hard ground. The motion is also designed to be more static; the robot isn’t trying to run and manipulate the ball simultaneously,” says Ji. “That’s where more difficult dynamics enter the control problem. We tackled this by extending recent advances that have enabled better outdoor locomotion into this compound task which combines aspects of locomotion and dexterous manipulation together.”

On the hardware side, the robot has a set of sensors that let it perceive the environment, allowing it to feel where it is, “understand” its position, and “see” some of its surroundings. It has a set of actuators that lets it apply forces and move itself and objects. In between the sensors and actuators sits the computer, or “brain,” tasked with converting sensor data into actions, which it will apply through the motors. When the robot is running on snow, it doesn’t see the snow but can feel it through its motor sensors. But soccer is a trickier feat than walking — so the team leveraged cameras on the robot’s head and body for a new sensory modality of vision, in addition to the new motor skill. And then — we dribble. 

“Our robot can go in the wild because it carries all its sensors, cameras, and compute on board. That required some innovations in terms of getting the whole controller to fit onto this onboard compute,” says Margolis. “That’s one area where learning helps because we can run a lightweight neural network and train it to process noisy sensor data observed by the moving robot. This is in stark contrast with most robots today: Typically a robot arm is mounted on a fixed base and sits on a workbench with a giant computer plugged right into it. Neither the computer nor the sensors are in the robotic arm! So, the whole thing is weighty, hard to move around.”

There’s still a long way to go in making these robots as agile as their counterparts in nature, and some terrains were challenging for DribbleBot. Currently, the controller is not trained in simulated environments that include slopes or stairs. The robot isn’t perceiving the geometry of the terrain; it’s only estimating its material contact properties, like friction. If there’s a step up, for example, the robot will get stuck — it won’t be able to lift the ball over the step, an area the team wants to explore in the future. The researchers are also excited to apply lessons learned during development of DribbleBot to other tasks that involve combined locomotion and object manipulation, quickly transporting diverse objects from place to place using the legs or arms.

The research is supported by the DARPA Machine Common Sense Program, the MIT-IBM Watson AI Lab, the National Science Foundation Institute of Artificial Intelligence and Fundamental Interactions, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator. The paper will be presented at the 2023 IEEE International Conference on Robotics and Automation (ICRA).

Written by Rachel Gordon MIT CSAIL · Categorized: AI, MIT AI · Tagged: AI, MIT AI

Mar 31 2023

Scaling vision transformers to 22 billion parameters

Posted by Piotr Padlewski and Josip Djolonga, Software Engineers, Google Research

Large Language Models (LLMs) like PaLM or GPT-3 showed that scaling transformers to hundreds of billions of parameters improves performance and unlocks emergent abilities. The biggest dense models for image understanding, however, have reached only 4 billion parameters, despite research indicating that promising multimodal models like PaLI continue to benefit from scaling vision models alongside their language counterparts. Motivated by this, and the results from scaling LLMs, we decided to undertake the next step in the journey of scaling the Vision Transformer.

In “Scaling Vision Transformers to 22 Billion Parameters”, we introduce the biggest dense vision model, ViT-22B. It is 5.5x larger than the previous largest vision backbone, ViT-e, which has 4 billion parameters. To enable this scaling, ViT-22B incorporates ideas from scaling text models like PaLM, with improvements to both training stability (using QK normalization) and training efficiency (with a novel approach called asynchronous parallel linear operations). As a result of its modified architecture, efficient sharding recipe, and bespoke implementation, it was able to be trained on Cloud TPUs with a high hardware utilization1. ViT-22B advances the state of the art on many vision tasks using frozen representations, or with full fine-tuning. Further, the model has also been successfully used in PaLM-e, which showed that a large model combining ViT-22B with a language model can significantly advance the state of the art in robotics tasks.

Architecture

Our work builds on many advances from LLMs, such as PaLM and GPT-3. Compared to the standard Vision Transformer architecture, we use parallel layers, an approach in which attention and MLP blocks are executed in parallel, instead of sequentially as in the standard Transformer. This approach was used in PaLM and reduced training time by 15%.

Secondly, ViT-22B omits biases in the QKV projections, part of the self-attention mechanism, and in the LayerNorms, which increases utilization by 3%. The diagram below shows the modified transformer architecture used in ViT-22B:

ViT-22B transformer encoder architecture uses parallel feed-forward layers, omits biases in QKV and LayerNorm layers and normalizes Query and Key projections.

Models at this scale necessitate “sharding” — distributing the model parameters in different compute devices. Alongside this, we also shard the activations (the intermediate representations of an input). Even something as simple as a matrix multiplication necessitates extra care, as both the input and the matrix itself are distributed across devices. We develop an approach called asynchronous parallel linear operations, whereby communications of activations and weights between devices occur at the same time as computations in the matrix multiply unit (the part of the TPU holding the vast majority of the computational capacity). This asynchronous approach minimizes the time waiting on incoming communication, thus increasing device efficiency. The animation below shows an example computation and communication pattern for a matrix multiplication.

Asynchronized parallel linear operation. The goal is to compute the matrix multiplication y = Ax, but both the matrix A and activation x are distributed across different devices. Here we illustrate how it can be done with overlapping communication and computation across devices. The matrix A is column-sharded across the devices, each holding a contiguous slice, each block represented as Aij. More details are in the paper.

At first, the new model scale resulted in severe training instabilities. The normalization approach of Gilmer et al. (2023, upcoming) resolved these issues, enabling smooth and stable model training; this is illustrated below with example training progressions.

The effect of normalizing the queries and keys (QK normalization) in the self-attention layer on the training dynamics. Without QK normalization (red) gradients become unstable and the training loss diverges.

Results

Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.

To illustrate the richness of the learned representation, we train a text model to produce representations that align text and image representations (using LiT-tuning). Below we show several results for out-of-distribution images generated by Parti and Imagen:

Examples of image+text understanding for ViT-22B paired with a text model. The graph shows normalized probability distribution for each description of an image.

Human object recognition alignment

To find out how aligned ViT-22B classification decisions are with human classification decisions, we evaluated ViT-22B fine-tuned with different resolutions on out-of-distribution (OOD) datasets for which human comparison data is available via the model-vs-human toolbox. This toolbox measures three key metrics: How well do models cope with distortions (accuracy)? How different are human and model accuracies (accuracy difference)? Finally, how similar are human and model error patterns (error consistency)? While not all fine-tuning resolutions perform equally well, ViT-22B variants are state of the art for all three metrics. Furthermore, the ViT-22B models also have the highest ever recorded shape bias in vision models. This means that they mostly use object shape, rather than object texture, to inform classification decisions — a strategy known from human perception (which has a shape bias of 96%). Standard models (e.g., ResNet-50, which has aa ~20–30% shape bias) often classify images like the cat with elephant texture below according to the texture (elephant); models with a high shape bias tend to focus on the shape instead (cat). While there are still many important differences between human and model perception, ViT-22B shows increased similarities to human visual object recognition.

Cat or elephant? Car or clock? Bird or bicycle? Example images with the shape of one object and the texture of a different object, used to measure shape/texture bias.
Shape bias evaluation (higher = more shape-biased). Many vision models have a low shape / high texture bias, whereas ViT-22B fine-tuned on ImageNet (red, green, blue trained on 4B images as indicated by brackets after model names, unless trained on ImageNet only) have the highest shape bias recorded in a ML model to date, bringing them closer to a human-like shape bias.

Out-of-distribution performance

Measuring performance on OOD datasets helps assess generalization. In this experiment we construct label-maps (mappings of labels between datasets) from JFT to ImageNet and also from ImageNet to different out-of-distribution datasets like ObjectNet (results after pre-training on this data shown in the left curve below). Then the models are fully fine-tuned on ImageNet.

We observe that scaling Vision Transformers increases OOD performance: even though ImageNet accuracy saturates, we see a significant increase on ObjectNet from ViT-e to ViT-22B (shown by the three orange dots in the upper right below).

Even though ImageNet accuracy saturates, we see a significant increase in performance on ObjectNet from ViT-e/14 to ViT-22B.

Linear probe

Linear probe is a technique where a single linear layer is trained on top of a frozen model. Compared to full fine-tuning, this is much cheaper to train and easier to set up. We observed that the linear probe of ViT-22B performance approaches that of state-of-the-art full fine-tuning of smaller models using high-resolution images (training with higher resolution is generally much more expensive, but for many tasks it yields better results). Here are results of a linear probe trained on the ImageNet dataset and evaluated on the ImageNet validation dataset and other OOD ImageNet datasets.

Distillation

The knowledge of the bigger model can be transferred to a smaller model using the distillation method. This is helpful as big models are slower and more expensive to use. We found that ViT-22B knowledge can be transferred to smaller models like ViT-B/16 and ViT-L/16, achieving a new state of the art on ImageNet for those model sizes.

Fairness and bias

ML models can be susceptible to unintended unfair biases, such as picking up spurious correlations (measured using demographic parity) or having performance gaps across subgroups. We show that scaling up the size helps in mitigating such issues.

First, scale offers a more favorable tradeoff frontier — performance improves with scale even when the model is post-processed after training to control its level of demographic parity below a prescribed, tolerable level. Importantly, this holds not only when performance is measured in terms of accuracy, but also other metrics, such as calibration, which is a statistical measure of the truthfulness of the model’s estimated probabilities. Second, classification of all subgroups tends to improve with scale as demonstrated below. Third, ViT-22B reduces the performance gap across subgroups.

Top: Accuracy for each subgroup in CelebA before debiasing. Bottom: The y-axis shows the absolute difference in performance across the two specific subgroups highlighted in this example: females and males. ViT-22B has a small gap in performance compared to smaller ViT architectures.

Conclusions

We have presented ViT-22B, currently the largest vision transformer model at 22 billion parameters. With small but critical changes to the original architecture, we achieved excellent hardware utilization and training stability, yielding a model that advances the state of the art on several benchmarks. Great performance can be achieved using the frozen model to produce embeddings and then training thin layers on top. Our evaluations further show that ViT-22B shows increased similarities to human visual perception when it comes to shape and texture bias, and offers benefits in fairness and robustness, when compared to existing models.

Acknowledgements

This is a joint work of Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin Fathy, Elsayed Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers Jeremiah Harmsen, and Neil Houlsby

We would like to thank Jasper Uijlings, Jeremy Cohen, Arushi Goel, Radu Soricut, Xingyi Zhou, Lluis Castrejon, Adam Paszke, Joelle Barral, Federico Lebron, Blake Hechtman, and Peter Hawkins. Their expertise and unwavering support played a crucial role in the completion of this paper. We also acknowledge the collaboration and dedication of the talented researchers and engineers at Google Research.


1Note: ViT-22B has 54.9% model FLOPs utilization (MFU) while PaLM reported
46.2% MFU and we measured 44.0% MFU for ViT-e on the same hardware. ↩

Written by Google AI · Categorized: AI · Tagged: AI

Mar 31 2023

Speeding up drug discovery with diffusion generative models

With the release of platforms like DALL-E 2 and Midjourney, diffusion generative models have achieved mainstream popularity, owing to their ability to generate a series of absurd, breathtaking, and often meme-worthy images from text prompts like “teddy bears working on new AI research on the moon in the 1980s.” But a team of researchers at MIT’s Abdul Latif Jameel Clinic for Machine Learning in Health (Jameel Clinic) thinks there could be more to diffusion generative models than just creating surreal images — they could accelerate the development of new drugs and reduce the likelihood of adverse side effects.

A paper introducing this new molecular docking model, called DiffDock, will be presented at the 11th International Conference on Learning Representations. The model’s unique approach to computational drug design is a paradigm shift from current state-of-the-art tools that most pharmaceutical companies use, presenting a major opportunity for an overhaul of the traditional drug development pipeline.

Drugs typically function by interacting with the proteins that make up our bodies, or proteins of bacteria and viruses. Molecular docking was developed to gain insight into these interactions by predicting the atomic 3D coordinates with which a ligand (i.e., drug molecule) and protein could bind together. 

While molecular docking has led to the successful identification of drugs that now treat HIV and cancer, with each drug averaging a decade of development time and 90 percent of drug candidates failing costly clinical trials (most studies estimate average drug development costs to be around $1 billion to over $2 billion per drug), it’s no wonder that researchers are looking for faster, more efficient ways to sift through potential drug molecules.

Currently, most molecular docking tools used for in-silico drug design take a “sampling and scoring” approach, searching for a ligand “pose” that best fits the protein pocket. This time-consuming process evaluates a large number of different poses, then scores them based on how well the ligand binds to the protein.

In previous deep-learning solutions, molecular docking is treated as a regression problem. In other words, “it assumes that you have a single target that you’re trying to optimize for and there’s a single right answer,” says Gabriele Corso, co-author and second-year MIT PhD student in electrical engineering and computer science who is an affiliate of the MIT Computer Sciences and Artificial Intelligence Laboratory (CSAIL). “With generative modeling, you assume that there is a distribution of possible answers — this is critical in the presence of uncertainty.”

“Instead of a single prediction as previously, you now allow multiple poses to be predicted, and each one with a different probability,” adds Hannes Stärk, co-author and first-year MIT PhD student in electrical engineering and computer science who is an affiliate of the MIT Computer Sciences and Artificial Intelligence Laboratory (CSAIL). As a result, the model doesn’t need to compromise in attempting to arrive at a single conclusion, which can be a recipe for failure.

To understand how diffusion generative models work, it is helpful to explain them based on image-generating diffusion models. Here, diffusion models gradually add random noise to a 2D image through a series of steps, destroying the data in the image until it becomes nothing but grainy static. A neural network is then trained to recover the original image by reversing this noising process. The model can then generate new data by starting from a random configuration and iteratively removing the noise.

In the case of DiffDock, after being trained on a variety of ligand and protein poses, the model is able to successfully identify multiple binding sites on proteins that it has never encountered before. Instead of generating new image data, it generates new 3D coordinates that help the ligand find potential angles that would allow it to fit into the protein pocket.

This “blind docking” approach creates new opportunities to take advantage of AlphaFold 2 (2020), DeepMind’s famous protein folding AI model. Since AlphaFold 1’s initial release in 2018, there has been a great deal of excitement in the research community over the potential of AlphaFold’s computationally folded protein structures to help identify new drug mechanisms of action. But state-of-the-art molecular docking tools have yet to demonstrate that their performance in binding ligands to computationally predicted structures is any better than random chance.

Not only is DiffDock significantly more accurate than previous approaches to traditional docking benchmarks, thanks to its ability to reason at a higher scale and implicitly model some of the protein flexibility, DiffDock maintains high performance, even as other docking models begin to fail. In the more realistic scenario involving the use of computationally generated unbound protein structures, DiffDock places 22 percent of its predictions within 2 angstroms (widely considered to be the threshold for an accurate pose, 1Å corresponds to one over 10 billion meters), more than double other docking models barely hovering over 10 percent for some and dropping as low as 1.7 percent.

These improvements create a new landscape of opportunities for biological research and drug discovery. For instance, many drugs are found via a process known as phenotypic screening, in which researchers observe the effects of a given drug on a disease without knowing which proteins the drug is acting upon. Discovering the mechanism of action of the drug is then critical to understanding how the drug can be improved and its potential side effects. This process, known as “reverse screening,” can be extremely challenging and costly, but a combination of protein folding techniques and DiffDock may allow performing a large part of the process in silico, allowing potential “off-target” side effects to be identified early on before clinical trials take place.

“DiffDock makes drug target identification much more possible. Before, one had to do laborious and costly experiments (months to years) with each protein to define the drug docking. But now, one can screen many proteins and do the triaging virtually in a day,” Tim Peterson, an assistant professor at the University of Washington St. Louis School of Medicine, says. Peterson used DiffDock to characterize the mechanism of action of a novel drug candidate treating aging-related diseases in a recent paper. “There is a very ‘fate loves irony’ aspect that Eroom’s law — that drug discovery takes longer and costs more money each year — is being solved by its namesake Moore’s law — that computers get faster and cheaper each year — using tools such as DiffDock.”

This work was conducted by MIT PhD students Gabriele Corso, Hannes Stärk, and Bowen Jing, and their advisors, Professor Regina Barzilay and Professor Tommi Jaakkola, and was supported by the Machine Learning for Pharmaceutical Discovery and Synthesis consortium, the Jameel Clinic, the DTRA Discovery of Medical Countermeasures Against New and Emerging Threats program, the DARPA Accelerated Molecular Discovery program, the Sanofi Computational Antibody Design grant, and a Department of Energy Computational Science Graduate Fellowship.

Written by Alex Ouyang Abdul Latif Jameel Clinic for Machine Learning i · Categorized: AI, MIT AI · Tagged: AI, MIT AI

Mar 30 2023

A method for designing neural networks optimally suited for certain tasks

Neural networks, a type of machine-learning model, are being used to help humans complete a wide variety of tasks, from predicting if someone’s credit score is high enough to qualify for a loan to diagnosing whether a patient has a certain disease. But researchers still have only a limited understanding of how these models work. Whether a given model is optimal for certain task remains an open question.

MIT researchers have found some answers. They conducted an analysis of neural networks and proved that they can be designed so they are “optimal,” meaning they minimize the probability of misclassifying borrowers or patients into the wrong category when the networks are given a lot of labeled training data. To achieve optimality, these networks must be built with a specific architecture.

The researchers discovered that, in certain situations, the building blocks that enable a neural network to be optimal are not the ones developers use in practice. These optimal building blocks, derived through the new analysis, are unconventional and haven’t been considered before, the researchers say.

In a paper published this week in the Proceedings of the National Academy of Sciences, they describe these optimal building blocks, called activation functions, and show how they can be used to design neural networks that achieve better performance on any dataset. The results hold even as the neural networks grow very large. This work could help developers select the correct activation function, enabling them to build neural networks that classify data more accurately in a wide range of application areas, explains senior author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Science (EECS).

“While these are new activation functions that have never been used before, they are simple functions that someone could actually implement for a particular problem. This work really shows the importance of having theoretical proofs. If you go after a principled understanding of these models, that can actually lead you to new activation functions that you would otherwise never have thought of,” says Uhler, who is also co-director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, and a researcher at MIT’s Laboratory for Information and Decision Systems (LIDS) and its Institute for Data, Systems and Society (IDSS).

Joining Uhler on the paper are lead author Adityanarayanan Radhakrishnan, an EECS graduate student and an Eric and Wendy Schmidt Center Fellow, and Mikhail Belkin, a professor in the Halicioğlu Data Science Institute at the University of California at San Diego.

Activation investigation

A neural network is a type of machine-learning model that is loosely based on the human brain. Many layers of interconnected nodes, or neurons, process data. Researchers train a network to complete a task by showing it millions of examples from a dataset.

For instance, a network that has been trained to classify images into categories, say dogs and cats, is given an image that has been encoded as numbers. The network performs a series of complex multiplication operations, layer by layer, until the result is just one number. If that number is positive, the network classifies the image a dog, and if it is negative, a cat.

Activation functions help the network learn complex patterns in the input data. They do this by applying a transformation to the output of one layer before data are sent to the next layer. When researchers build a neural network, they select one activation function to use. They also choose the width of the network (how many neurons are in each layer) and the depth (how many layers are in the network.)

“It turns out that, if you take the standard activation functions that people use in practice, and keep increasing the depth of the network, it gives you really terrible performance. We show that if you design with different activation functions, as you get more data, your network will get better and better,” says Radhakrishnan.

He and his collaborators studied a situation in which a neural network is infinitely deep and wide — which means the network is built by continually adding more layers and more nodes — and is trained to perform classification tasks. In classification, the network learns to place data inputs into separate categories.

“A clean picture”

After conducting a detailed analysis, the researchers determined that there are only three ways this kind of network can learn to classify inputs. One method classifies an input based on the majority of inputs in the training data; if there are more dogs than cats, it will decide every new input is a dog. Another method classifies by choosing the label (dog or cat) of the training data point that most resembles the new input.

The third method classifies a new input based on a weighted average of all the training data points that are similar to it. Their analysis shows that this is the only method of the three that leads to optimal performance. They identified a set of activation functions that always use this optimal classification method.

“That was one of the most surprising things — no matter what you choose for an activation function, it is just going to be one of these three classifiers. We have formulas that will tell you explicitly which of these three it is going to be. It is a very clean picture,” he says.

They tested this theory on a several classification benchmarking tasks and found that it led to improved performance in many cases. Neural network builders could use their formulas to select an activation function that yields improved classification performance, Radhakrishnan says.

In the future, the researchers want to use what they’ve learned to analyze situations where they have a limited amount of data and for networks that are not infinitely wide or deep. They also want to apply this analysis to situations where data do not have labels.

“In deep learning, we want to build theoretically grounded models so we can reliably deploy them in some mission-critical setting. This is a promising approach at getting toward something like that — building architectures in a theoretically grounded way that translates into better results in practice,” he says.

This work was supported, in part, by the National Science Foundation, Office of Naval Research, the MIT-IBM Watson AI Lab, the Eric and Wendy Schmidt Center at the Broad Institute, and a Simons Investigator Award.

Written by Adam Zewe MIT News Office · Categorized: AI, MIT AI · Tagged: AI, MIT AI

Mar 30 2023

Data-centric ML benchmarking: Announcing DataPerf’s 2023 challenges

Posted by Peter Mattson, Senior Staff Engineer, ML Performance, and Praveen Paritosh, Senior Research Scientist, Google Research, Brain Team

Machine learning (ML) offers tremendous potential, from diagnosing cancer to engineering safe self-driving cars to amplifying human productivity. To realize this potential, however, organizations need ML solutions to be reliable with ML solution development that is predictable and tractable. The key to both is a deeper understanding of ML data — how to engineer training datasets that produce high quality models and test datasets that deliver accurate indicators of how close we are to solving the target problem.

The process of creating high quality datasets is complicated and error-prone, from the initial selection and cleaning of raw data, to labeling the data and splitting it into training and test sets. Some experts believe that the majority of the effort in designing an ML system is actually the sourcing and preparing of data. Each step can introduce issues and biases. Even many of the standard datasets we use today have been shown to have mislabeled data that can destabilize established ML benchmarks. Despite the fundamental importance of data to ML, it’s only now beginning to receive the same level of attention that models and learning algorithms have been enjoying for the past decade.

Towards this goal, we are introducing DataPerf, a set of new data-centric ML challenges to advance the state-of-the-art in data selection, preparation, and acquisition technologies, designed and built through a broad collaboration across industry and academia. The initial version of DataPerf consists of four challenges focused on three common data-centric tasks across three application domains; vision, speech and natural language processing (NLP). In this blogpost, we outline dataset development bottlenecks confronting researchers and discuss the role of benchmarks and leaderboards in incentivizing researchers to address these challenges. We invite innovators in academia and industry who seek to measure and validate breakthroughs in data-centric ML to demonstrate the power of their algorithms and techniques to create and improve datasets through these benchmarks.

Data is the new bottleneck for ML

Data is the new code: it is the training data that determines the maximum possible quality of an ML solution. The model only determines the degree to which that maximum quality is realized; in a sense the model is a lossy compiler for the data. Though high-quality training datasets are vital to continued advancement in the field of ML, much of the data on which the field relies today is nearly a decade old (e.g., ImageNet or LibriSpeech) or scraped from the web with very limited filtering of content (e.g., LAION or The Pile).

Despite the importance of data, ML research to date has been dominated by a focus on models. Before modern deep neural networks (DNNs), there were no ML models sufficient to match human behavior for many simple tasks. This starting condition led to a model-centric paradigm in which (1) the training dataset and test dataset were “frozen” artifacts and the goal was to develop a better model, and (2) the test dataset was selected randomly from the same pool of data as the training set for statistical reasons. Unfortunately, freezing the datasets ignored the ability to improve training accuracy and efficiency with better data, and using test sets drawn from the same pool as training data conflated fitting that data well with actually solving the underlying problem.

Because we are now developing and deploying ML solutions for increasingly sophisticated tasks, we need to engineer test sets that fully capture real world problems and training sets that, in combination with advanced models, deliver effective solutions. We need to shift from today’s model-centric paradigm to a data-centric paradigm in which we recognize that for the majority of ML developers, creating high quality training and test data will be a bottleneck.

Shifting from today’s model-centric paradigm to a data-centric paradigm enabled by quality datasets and data-centric algorithms like those measured in DataPerf.

Enabling ML developers to create better training and test datasets will require a deeper understanding of ML data quality and the development of algorithms, tools, and methodologies for optimizing it. We can begin by recognizing common challenges in dataset creation and developing performance metrics for algorithms that address those challenges. For instance:

  • Data selection: Often, we have a larger pool of available data than we can label or train on effectively. How do we choose the most important data for training our models?
  • Data cleaning: Human labelers sometimes make mistakes. ML developers can’t afford to have experts check and correct all labels. How can we select the most likely-to-be-mislabeled data for correction?

We can also create incentives that reward good dataset engineering. We anticipate that high quality training data, which has been carefully selected and labeled, will become a valuable product in many industries but presently lack a way to assess the relative value of different datasets without actually training on the datasets in question. How do we solve this problem and enable quality-driven “data acquisition”?

DataPerf: The first leaderboard for data

We believe good benchmarks and leaderboards can drive rapid progress in data-centric technology. ML benchmarks in academia have been essential to stimulating progress in the field. Consider the following graph which shows progress on popular ML benchmarks (MNIST, ImageNet, SQuAD, GLUE, Switchboard) over time:

Performance over time for popular benchmarks, normalized with initial performance at minus one and human performance at zero. (Source: Douwe, et al. 2021; used with permission.)

Online leaderboards provide official validation of benchmark results and catalyze communities intent on optimizing those benchmarks. For instance, Kaggle has over 10 million registered users. The MLPerf official benchmark results have helped drive an over 16x improvement in training performance on key benchmarks.

DataPerf is the first community and platform to build leaderboards for data benchmarks, and we hope to have an analogous impact on research and development for data-centric ML. The initial version of DataPerf consists of leaderboards for four challenges focused on three data-centric tasks (data selection, cleaning, and acquisition) across three application domains (vision, speech and NLP):

  • Training data selection (Vision): Design a data selection strategy that chooses the best training set from a large candidate pool of weakly labeled training images.
  • Training data selection (Speech): Design a data selection strategy that chooses the best training set from a large candidate pool of automatically extracted clips of spoken words.
  • Training data cleaning (Vision): Design a data cleaning strategy that chooses samples to relabel from a “noisy” training set where some of the labels are incorrect.
  • Training dataset evaluation (NLP): Quality datasets can be expensive to construct, and are becoming valuable commodities. Design a data acquisition strategy that chooses which training dataset to “buy” based on limited information about the data.

For each challenge, the DataPerf website provides design documents that define the problem, test model(s), quality target, rules and guidelines on how to run the code and submit. The live leaderboards are hosted on the Dynabench platform, which also provides an online evaluation framework and submission tracker. Dynabench is an open-source project, hosted by the MLCommons Association, focused on enabling data-centric leaderboards for both training and test data and data-centric algorithms.

How to get involved

We are part of a community of ML researchers, data scientists and engineers who strive to improve data quality. We invite innovators in academia and industry to measure and validate data-centric algorithms and techniques to create and improve datasets through the DataPerf benchmarks. The deadline for the first round of challenges is May 26th, 2023.

Acknowledgements

The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive.ai, Eidgenössische Technische Hochschule (ETH) Zurich, Google, Harvard University, Meta, ML Commons, Stanford University. In addition, this would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Digital Prism Advisors, Factored, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.

Written by Google AI · Categorized: AI · Tagged: AI

  • Go to page 1
  • Go to page 2
  • Go to page 3
  • Interim pages omitted …
  • Go to page 56
  • Go to Next Page »
  • Twitter
  • Facebook
  • About Us
  • LinkedIn
  • ANTI-SPAM POLICY
  • Google+
  • API Terms and Conditions
  • RSS
  • Archive Page
  • Biz Builder Mike is all about New World Marketing
  • Cryptocurrency Exchange
  • Digital Millennium Copyright Act (DMCA) Notice
  • DMCA Safe Harbor Explained: Why Your Website Needs a DMCA/Copyright Policy
  • Marketing? Well, how hard can that be?
  • Michael Noel
  • Michael Noel CBP
  • Noels Law of decentralization

Copyright © 2023 · Altitude Pro on Genesis Framework · WordPress · Log in