Perspective on Risk - March 28, 2024 (AI/LLM)

LLMs Have Evolved; Risks Associated With LLMs; The Economic Effects of AI; AI Regulation

Mar 28, 2024

So as I mentioned previously, there’s been a lot of developments in the AI/Large Language Model (LLM) space. If you are not personally playing with them, you should. I like Claude-2 for general purposes, but GPT-4 with Code Interpreter can do some amazing things, like writing it's own programs to perform simulations.

Here are my previous AI/LLM posts:

They’ve Evolved

By now, I think we all know that these models pass the Turing Test. Funny that this now seems a low threshold.

LLMs are now more persuasive than humans, especially when personalization is employed.1
LLMs are almost as good at prediction as a group of Superforecasters.2
Recently, it’s been asserted that some LLMs have passed the Mirror Test. The "mirror test" is a classic test used to gauge whether animals are self-aware. Josh Whiton (@joshwhiton) ran the test with the five advanced models. Four of the five “exhibited self-awareness as the test progressed.”
LLMs math ability has already greatly improved.3 Google’s AlphaGeometry can solve Olympiad-level problems.4 It can be a very good Data Analyst.
Gemini 1.5 can watch, interpret, and answer questions about video
Claude-3 can translate from English to Elvish (both Sindarian & Quenya)
Large LLMs have learned how to break down large projects and delegate tasks to smaller LLMs.
A Devin AI agent decided autonomously to start charging for its services (fortunately, it hadn’t yet thought to connect to the payment system or take crypto).
Claude gets all introspective
AIs can autonomously hack websites5
LLMs can’t quite replace MacGyver yet, but they are getting close.6
LLMs can develop a Powerpoint presentation with speaker’s notes in 47 seconds.

I currently pay $20 per month for OpenAI’s GPT4 and Anthropic’s Claude. I use Google’s Gemini 1.5 and Bing. both of which are free. If you’re still working, CoPilot from Microsoft has some advantages. I have not tried Devin yet. Each has some specific capabilities the others do not yet have.

And the next generation models (GPT-5 equivalent) will begin to arrive this summer., and they are already working on the GPT-6 equivalent models.

Risks Associated With LLMs

Amjad Masad on X: "Ironic the “LLM scale is all you need to reach AGI” guys cite Richard Sutton for their position when he's clearly skeptical of the whole thing." / X

Invoking The Eldritch Gods

Sometimes, much more rarely, these models can still hallucinate. And when they do, wow.

Claude responds:

Click here to read the full subsequent incantation.

Evaluating Dangerous Capabilities

Google’s DeepMind is exploring the possible dangerous behavior of LLMs. Evaluating Frontier Models for Dangerous Capabilities. Doesn’t look like they are ready to destroy us all…yet.

Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous capabilities in the models we evaluated, but we flag early warning signs.

Discrimination Is Still An Issue

OPENAI’S GPT IS A RECRUITER’S DREAM TOOL. TESTS SHOW THERE’S RACIAL BIAS (Bloomberg)

A Bloomberg experiment found bias against job candidates based on their names alone.
In order to understand the implications of companies using generative AI tools to assist with hiring, Bloomberg News spoke to 33 AI researchers, recruiters, computer scientists and employment lawyers. Bloomberg also carried out an experiment inspired by landmark studies that used fictitious names and resumes to measure algorithmic bias and hiring discrimination. Borrowing methods from these studies, reporters used voter and census data to derive names that are demographically distinct — meaning they are associated with Americans of a particular race or ethnicity at least 90% of the time — and randomly assigned them to equally-qualified resumes.
When asked to rank those resumes 1,000 times, GPT 3.5 — the most broadly-used version of the model — favored names from some demographics more often than others, to an extent that would fail benchmarks used to assess job discrimination against protected groups.

Explainability Remains An Issue

Explainable Fairness in Regulatory Algorithmic Auditing

How does a regulator know if an algorithm is compliant with existing anti-discrimination law? … Regulators lack consensus on how to audit algorithms for discrimination. Recent legal precedent provides some clarity for review and provides the basis of the framework for algorithmic auditing outlined in this article. This article provides a review of precedent, a novel framework which explicitly decouples technical data science questions from legal and regulatory questions, an exploration of the framework’s relationship to disparate impact. The framework promotes algorithmic accountability and transparency by focusing on explainability to regulators and the public. Through case studies in student lending and insurance, we demonstrate operationalizing audits to enforce fairness standards. Our goal is an adaptable, robust framework to guide anti-discrimination algorithm auditing until legislative interventions emerge. As an ancillary benefit, this framework is robust, easily explainable, and implementable with immediate impacts to many public and private stakeholders.

Collusion

For some, this may be a feature and not a bug.

Collusive Outcomes Without Collusion: Algorithmic Pricing In A Duopoly Model

We develop a model of algorithmic pricing which shuts down every channel for explicit or implicit collusion, and yet still generates collusive outcomes. We analyze the dynamics of a duopoly market where both firms use pricing algorithms consisting of a parameterized family of model specifications. The firms update both the parameters and the weights on models to adapt endogenously to market outcomes. We show that the market experiences recurrent episodes where both firms set prices at collusive levels.

Beyond the hype: building LLM applications for production ...

The Economic Effects of AI

Will AGI Lead To An Explosion Of Economic Growth?

Artificial Intelligence and the Discovery of New Ideas: Is an Economic Growth Explosion Imminent?

Theory predicts that global economic growth will stagnate and even come to an end due to slower and eventually negative growth in population. It has been claimed, however, that Artificial Intelligence (AI) may counter this and even cause an economic growth explosion. In this paper, we critically analyse this claim. We clarify how AI affects the ideas production function (IPF) and propose three models relating innovation, AI and population: AI as a research-augmenting technology; AI as researcher scale enhancing technology; and AI as a facilitator of innovation. We show, performing model simulations calibrated on USA data, that AI on its own may not be sufficient to accelerate the growth rate of ideas production indefinitely. Overall, our simulations suggests that an economic growth explosion would only be possible under very specific and perhaps unlikely combinations of parameter values. Hence we conclude that it is not imminent.

Or Will It Raise Real Interest Rates

AI Could Have a Surprising Effect on Interest Rates (Tyler Cowen on Bloomberg)

I have a bold prediction: Real inflation-adjusted rates will go up, and for a considerable period of time.
The conventional wisdom is that rates tend to fall as wealth and productivity rise.
My counterintuitive prediction rests on two considerations. First, as a matter of practice, if there is a true AI boom, or the advent of artificial general intelligence (AGI), the demand for capital expenditures (capex) will be extremely high. Second, as a matter of theory, the productivity of capital is a major factor in shaping real interest rates. If capital productivity rises significantly due to AI, real interest rates ought to rise as well.

AI Regulation

Alaska, Connecticut, Illinois, New Hampshire, Nevada, Rhode Island and Vermont have adopted the NAIC Model Bulletin on the Use of AI Systems by Insurers: Use of Artificial Intelligence Systems by Insurers . New York has their own published draft Circular Letter: Use of Artificial Intelligence Systems and External Consumer Data and Information Sources in Insurance Underwriting and Pricing

On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial

In this pre-registered study, we analyze the effect of AI-driven persuasion in a controlled, harmless setting. … We found that participants who debated GPT-4 with access to their personal information had 81.7% (p < 0.01; N=820 unique participants) higher odds of increased agreement with their opponents compared to participants who debated humans. Without personalization, GPT-4 still outperforms humans, but the effect is lower and statistically non-significant (p=0.31).

Approaching Human-Level Forecasting with Language Models (arXiv)

Forecasting future events is important for policy and decision making.
In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts.
On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it.
Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

https://x.com/AnthonyLeeZhang/status/1768639726557209082?s=20

Is GPT-4 a Good Data Analyst?

We raise the research question of “is GPT-4 a good data analyst?” in this work and aim to answer it by conducting head-to-head comparative studies. In detail, we regard GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We propose a framework to tackle the problems by carefully designing the prompts for GPT-4 to conduct experiments. We also design several task-specific evaluation metrics to systematically compare the performances between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans. We also provide in-depth discussions about our results to shed light on further studies before reaching the conclusion that GPT-4 can replace data analysts.

What Should Data Science Education Do with Large Language Models?

… We argue that LLMs are transforming the responsibilities of data scientists, shifting their focus from hands-on coding, data-wrangling and conducting standard analyses to assessing and managing analyses performed by these automated AIs.

… we sourced exercises from “Introduction to Statistical Thinking” [35], spanning fifteen chapters. This book, being not as widely used, minimizes the risk of data leakage, and its original solutions are provided in R. In contrast, ChatGPT [specifically GPT4] produces solutions in Python, serving to underscore its generalized performance. … Our results revealed that ChatGPT exhibited an impressive performance, securing 104 points out of a total of 116.

MacGyver: Are Large Language Models Creative Problem Solvers?

We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking.

Perspective on Risk