STEM skills in science, technology, engineering, and mathematics are the foundation for solving many real-world problems. For example, exploring protein structures, proving mathematical theorems, and discovering new drugs. (Editor's note: STEM stands for the acronym of the first letters of the four disciplines: Science, Technology, Engineering, and Mathematics.)
For the field of artificial intelligence, understanding multimodal information of vision and text is key to mastering STEM skills.
However, existing datasets mainly focus on testing the model's ability to solve expert-level problems, which is difficult to reflect the model's grasp of basic knowledge. Moreover, they often only consider textual information while ignoring visual information, or only focus on the ability of a single discipline in STEM.
In addition, due to the lack of fine-grained information, scientists in this field are also unable to better analyze and improve the weaknesses of neural network models.
Therefore, the content generated by the model in this situation is neither fully trusted nor helps to guide the direction of future model development.More importantly, due to the lack of data related to human performance, scientists are also unable to obtain more practically meaningful model performance references, which severely hinders the healthy development of artificial intelligence.
Advertisement
To overcome the above limitations, recently, a research team from Peking University and Washington University in St. Louis, USA, not only successfully completed the construction of the first multimodal STEM dataset, but also achieved the evaluation of large language models and multimodal foundation models on this basis.
The results show that even the most advanced artificial intelligence models currently have a large room for improvement in their basic level of STEM, and do not yet have the ability to solve more difficult real-world problems. That is to say, compared with human intelligence, the current level of artificial intelligence still has a certain gap.
Recently, the relevant paper was included in the 2024 International Conference on Learning Representations (ICLR 2024) with the title "Measuring Vision-Language STEM Skills of Neural Models" [1].
It is reported that the conference will be held in Vienna, the capital of Austria, from May 7 to May 11 this year.Here are the resources related to the STEM dataset:
Evaluation link:
Dataset page:
Code GitHub:
Shen Jianhao, a doctoral student at Peking University, and Yuan Ye are co-first authors. Assistant Professor Wang Chenguang from Washington University in St. Louis and Professor Zhang Ming from Peking University are co-corresponding authors. Assistant Professor Wang Chenguang graduated with a doctorate from Peking University, with Professor Zhang Ming as his advisor.Building a STEM Dataset for Comprehensively Evaluating the Basic Science and Engineering Capabilities of Neural Network Models
According to Wang Chenguang, after the research team determined their research objectives and topics, they began to collect data.
The team members, who are always good at algorithm research, faced some difficulties when dealing with tasks such as web crawling, data cleaning, and deduplication. Despite this, they still took on the challenge, designed various rules for data cleaning and deduplication, and ultimately successfully obtained the first multimodal STEM dataset.
It is worth mentioning that this dataset includes 448 STEM skills, with a total of 1,073,146 questions, making it the most comprehensive and largest multimodal STEM question dataset currently available.
Next, they began to evaluate and analyze the dataset.Due to the dataset containing three-dimensional labels of subjects (Science, Technology, Engineering, Mathematics), skills, and grade levels, researchers chose to delve into these three dimensions, conducting a thorough analysis of the distribution of data quantity, problem type distribution, problem length distribution, and other information for each dimension.
At the same time, they also divided the training set, validation set, and the test set with undisclosed labels for each subject according to the ratio of 6:2:2.
Subsequently, researchers designed a model evaluation plan.
In the selection of evaluation metrics, in addition to focusing on accuracy, they also used the test scores from one of the most recognized online exercise websites globally () as a key indicator.
The latter is derived from the actual test scores of millions of users on the site and is positively correlated with the student's mastery of knowledge. When the score reaches above 90 (usually at the elementary school level), it represents that the student has mastered that skill."We let the model imitate examinees answering questions online, and then compare the obtained exam scores with the actual human exam results," said Wang Chenguang.
This is also a highlight of the work. The reason is that in the past, when comparing human performance with artificial intelligence, the former was summarized from a relatively small sample (such as a few hundred to a few thousand people), while the team's results are based on data at the tens of millions level, which is more credible.
Then, in the model evaluation phase, researchers chose to use the current mainstream large foundation models, including OpenAI's multimodal CLIP model and the GPT3.5-Turbo version of the large language model ChatGPT.
The former makes choices based on the model's judgment of the matching degree between the question options and the pictures, while the latter uses the caption model to generate descriptions for the pictures and uses the language model to select the answers.
"We evaluated models of different sizes of CLIP and GPT3.5-Turbo and found that under the zero-sample setting, the error rate of the models is very high. This indicates that existing models cannot directly truly master this knowledge," said Wang Chenguang.Furthermore, they also used the divided training dataset to fine-tune the CLIP model, and found that the fine-tuned model achieved significant performance improvement, with the overall accuracy rate increasing from 54.4% to 76.3%. However, there is still a certain gap from 90 points.
In addition to this, the research group also analyzed various aspects of the model results.
Specifically, first, at the grade level, they found that the model's test scores decreased as the grade level of the questions increased, which is in line with the expectation that the difficulty of the questions increases with the higher grade level.
Secondly, through the model's evaluation performance on different skills, they found that the model's performance on abstract knowledge and complex reasoning tasks was poor.
In addition, past experience has shown that the model should have a high predictive confidence in the correct answer, which represents a better calibration of the model."We found that the fine-tuned model on our dataset demonstrated good calibration, with the model's confidence level showing a clear correlation with accuracy," said Wang Chenguang.
On the other hand, in their research on the relationship between model size and performance, they also discovered a clear positive correlation.
At the same time, they also analyzed the relationship between model performance and other factors such as question length, question type, and the number of options, finding that the model's performance would decline as the questions became longer, the number of options increased, and the number of examples decreased.
In addition to this, they also assessed the correlation between accuracy and test scores, finding that they also showed a significant positive correlation.
"Ultimately, in terms of overall evaluation metrics, we confirmed that even the fine-tuned model has a significant gap compared to the level of students in the corresponding grade. Based on this, we still need to find more effective methods to enable the model to master STEM knowledge and skills," said Wang Chenguang.Attempt to launch more datasets for evaluating large language models, to accelerate the process of achieving general artificial intelligence.
It is evident that in this research, the STEM dataset played a pivotal role.
Not only does it benefit the model in enhancing its foundational knowledge in STEM, but it also helps researchers assess the model's grasp of basic STEM skills and make targeted improvements to the model through fine-grained data analysis.
Wang Chenguang stated that he and his team look forward to the dataset further advancing the current research on multimodal large models, moving closer to the goal where the model can fully understand STEM skills and solve real-world STEM problems.And, it is also hoped that the released test set can serve as one of the standard evaluations for assessing the capabilities of basic AI models and gain widespread use in the community.
"More importantly, the comparison we provide with the actual level of large-scale humans (mainly primary school students) can serve as a target and reference for future model development, to accelerate the process of achieving the goal of general artificial intelligence," he said.
Currently, based on this dataset, the research group has successfully evaluated the capabilities of neural network models in basic education in the fields of science and engineering.
Next, on the one hand, they plan to continue collecting data and try to launch datasets in fields such as humanities, social sciences, etc., to better evaluate the capabilities of large language models in other key disciplines.
It is worth noting in this regard that the team has recently proposed a new social science dataset called Social, which includes a large-scale text assessment data and can be used to evaluate the basic capabilities of large language models in social sciences.Furthermore, a method of multi-agent interaction has been designed to enhance the performance of large language models on the Social dataset.
The related paper, titled "Measuring Social Norms of Large Language Models," has been included in the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) [2].
It is reported that the conference will be held in Mexico City, the capital of Mexico, from June 16 to June 21 this year.
On the other hand, they also intend to identify the shortcomings of the model's capabilities by studying its performance on fine-grained datasets and researching how to improve.
In addition, they hope to further enhance the model's foundational abilities by combining the RAG method with retrieval, designing special model architectures, and training methods."We believe that only by first achieving breakthroughs in the basic fields of science, engineering, and liberal arts, and solidly laying a foundation, can artificial intelligence have the possibility of being further applied," said Wang Chenguang.
Leave your email and subscribe to our latest articles