Nowadays, AI technology is a leading trend. That’s no surprise, language models such as GPT-4 and Claude exhibit remarkable capabilities, but the underlying training data has been shrouded in secrecy. The Allen Institute for AI (AI2) seeks to disrupt this pattern by introducing an expansive text dataset that is both freely accessible and open to scrutiny. Today, let’s talk about the new way to build advanced language models, how it can influence AI regulation, and the dangers it presents.
AI2 releases its largest open dataset to date, designed for training advanced language models.
Named Dolma, this dataset serves as the cornerstone for the research team's upcoming open language model, known as OLMo (short for "Data to feed OLMo's Appetite"). The rationale behind this initiative, as AI2 researchers contend, is that just as the model is intended to be freely accessible and customizable for the AI research community, so should be the dataset employed in its creation.
In the chart crafted by AI2, it's evident that the largest and most recent models merely offer a fraction of the details that a researcher would probably seek about a specific dataset. What specific information has been omitted, and what were the reasons for these omissions? How was the distinction made between high-quality and low-quality text? Were any personal particulars effectively removed?
While AI2 is not the pioneering attempt at an open dataset, it does stand out as the most extensive, encompassing a whopping 3 billion tokens (an AI-specific measure of content volume). This project asserts itself as the most user-friendly in terms of utilization and permissions. Employing the "ImpACT license for medium-risk artefacts," which you can explore in detail, the process for potential.
Dolma users involves:
1. Providing contact information and specifying intended use cases.
2. Disclosing any derivative creations stemming from Dolma.
3. Distributing these derivatives under the same licensing terms.
4. Committing not to apply Dolma to prohibited domains like surveillance or disinformation.
If this aligns with your requirements, access to Dolma can be attained via Hugging Face.
How can an open AI dataset impact the world? The new step in AI regulations is coming.
Nowadays, the biggest issue with powerful AI is still regulation. The regulation of artificial intelligence has emerged as a prominent subject of discussion in Washington over the past few months. Lawmakers have been conducting hearings and organizing press briefings, while the White House has taken a step forward by revealing voluntary safety commitments for AI from seven technology companies last month. Even though, after the regulations, companies such as OpenAI and Meta do release certain crucial dataset statistics used in constructing their language models, a significant portion of this information remains proprietary. Beyond the established outcome of hindering comprehensive analysis and overall enhancement, there are speculations that this closed strategy could stem from potential ethical or legal concerns regarding data acquisition.
Here comes the open dataset, Dolma. Instead of hiding the code, the company allows users to use it. This, of course, can positively impact AI regulations by countries and world leaders. For those concerned about potential personal data inclusion despite AI2's stringent efforts, a removal request form is accessible. This process caters to specific cases rather than a broad "do not use" approach. As it was mentioned before, AI2 is not the pioneering attempt at an open dataset, it does stand out as the most extensive. At the beginning of AI regulations, the main idea was to protect humanity from “unexpected circumstances” caused by powerful AI. Nowadays, when people are aware of this technology and AI regulations are more official, the open dataset can have a huge impact on future regulations.
But with the positive impact, there is also a danger to worry about. The biggest issue of such an open code is the opportunity for hackers to create an AI “without limits”. AI without any regulations and limits could lead to a loss of human control, autonomy, and dignity, and possibly even an existential risk for our species. We’ve already heard about when a US Air Force drone tried to eliminate its operator. In the wrong hands, we can only imagine what it can lead to. The “Iron Man” scene, where Tony Stark finds himself surrounded and attacked by a swarm of his own robots, seems real now.
To be sure, that such a way to create AI is safe, we are waiting for the upcoming TechCrunch Disrupt, where Gary Marcus will discuss AI regulation. Amid the oscillation between exaggerated excitement and undue alarm, there arises a need for a rational perspective. Respected within academic circles, Marcus delivered testimony before the Senate Judiciary Committee last May, sharing the stage with OpenAI CEO Sam Altman and Christina Montgomery, IBM's Chief Privacy Trust Officer. Marcus firmly believes that AI introduces significant risks, offering an unbiased viewpoint to the discourse. Not affiliated with any AI company, he brings an impartial voice to the table. The path forward demands action, regulation, and collaboration to assess whether the advantages of AI outweigh the associated risks before companies introduce AI products to consumers. Equipped with innovative concepts to facilitate this process, we eagerly await his insights on Dolma.
In summary, the rise of AI technology brings both excitement and apprehension. The introduction of AI2's open dataset, Dolma, marks a significant step towards transparent and collaborative language model development. However, while open datasets can positively influence AI regulations, they also raise concerns about uncontrolled AI capabilities in the wrong hands. The upcoming TechCrunch Disrupt discussion by Gary Marcus is anticipated to offer insights into navigating these complexities. As the AI landscape evolves, the delicate balance between innovation and responsible use remains crucial.