Has Anthropic just wiped out the Software Testing Industry?

Discover how Anthropic's groundbreaking AI technology, **Computer Use**, transforms UI/UX testing. By generating and executing test cases autonomously, it minimizes human intervention and maximizes efficiency. Experience firsthand results as the AI navigates and documents tests in natural languag...

In a previous blog post I wrote about an amazing new technology developed by Anthropic called Computer Use. Although the name sounds very boring to me, the technology itself is a game changer and I have no doubt that it will have a huge impact on the entire economy sooner or later (and I am a rather sceptical person).

From my 10+ years of personal experience with IT, there is only one way to find out how much truth a software really brings to the table, regardless of what the vendor promises or tries to sell:

To Try it!

And that's exactly what I've done with more than one use case, and the results have far exceeded my expectations.

About this Use Case

One of the most fundamental parts of the software industry is testing. There are many types of testing:

Unit testing
E2E testing
UI/UX testing
Penetration Testing
Performance testing
and so on.

There are already many ways to create all kinds of test scenarios with a few clicks, like unit tests or even E2E with Selenium, Playwright etc.

But what about UI/UX testing?

Imagine you have developed a new feature that needs to be tested according to the specification.

For example, a shopping cart.

To test the ordering of "product XYZ" on your website, a super simple test would be:

Navigate manually to the website.
Click on "Product XYZ" to view its details.
Add the product to your shopping basket.
Proceed to checkout.
Fill in all the required forms (delivery details, payment details).
Complete the purchase.
Write observations and take screenshots at each step.

And so on.

And that is just one test case. A typical e-commerce site, for example, would have dozens if not hundreds of use cases, each with branches, different browsers, etc. And every time something changes, we have to update the test cases.

And of course this needs to be documented and a test report generated. And so on.

In the past, many efforts have been made to automate such tasks, but it has always required prior knowledge of the application and a precise description of the steps to be taken by the robot (if I may call it that).

But we are now at a turning point.

With Anthropic's Computer Use, AI agents can fully autonomously define and execute UI test scenarios with zero prior knowledge and almost no human intervention. And, you guessed it, in natural language as simple as:

"Test the purchase of an ebook from search to shopping cart."

And with a few tweaks and the right tools, Claude can:

Generate the UI/UX test instructions from your codebase
Execute those instructions
Document every step and every observation made
Provide evidence in the form of live screenshots

What makes this technology so special?

Controlling a computer screen with AI is nothing new, and there are thousands of papers and solutions that have tried to solve this problem, and have solved it to some extent, with one caveat: they are either tailored to a specific domain, or they are generalistic, but unfortunately often fail when unexpected things happen.

This has now changed with two new key advances, specifically from Anthropic:

Claude's incredible reasoning abilities are unprecedented.
The visual capabilities not only recognises objects in images, but can also determine the type of elements and their position with very high precision.

Sounds promising?

Let's see how it performs in practice:

The application.

I developed an application that takes historical events of the day and generates images with AI for these events and displays them on a page. Pretty cool, eh?

The environment

For this test scenario I only need three things:

Visual Studio code
Access to a Claude model (API key)
Cline (prev. Claude Dev) plugin. If you don't know what Cline is, please read my previous post about this amazing code assistant.

Step 1: Generate test cases

In the first step, I asked Claude to generate UI/UX test cases for the application. Note that if you do not have access to the codebase, you can still let Claude explore the web application on his own, given a very specific goal (more on this in other blog posts about CU).

I asked Claude to limit the number of tests to 3. Here is what he gave me:

Looks good. Let's execute them.

Step 2: Execute the test cases and document the results.

Claude executed the generated test cases on its own, interacted with the application, took screenshots and documented the results.

And here in more detail:

The objective was to perform UI testing to ensure that the application performed as expected. The AI agent focused on three key test cases:

API key validation
Display historical events
Image Generation Error Handling

Test Case 1: API Key Validation
Objective: To verify that the application correctly validates OpenAI API keys.
Steps:

Launch the application: Access http://localhost:5173/.
Enter an invalid API key: Entered invalid-key in the API key input field.

Image Description: The input field contains the text "invalid-key".

Submit the invalid key: Click the Submit button.
See error message: The application has displayed an error message.

Image description: An error message stating "Failed to generate image for 1912: Failed to fetch".

Enter a valid API key: Enter a valid API key formatted as sk-....

Image Description: The input field contains a valid API key.

Submit the valid key: The Submit button has been clicked.

Expected results:

Invalid key should be rejected with an error message.
Valid key should be accepted and continue to fetch events.

Actual Results:

The application correctly displayed an error for the invalid API key.
When a valid API key was entered, the application attempted to retrieve data but encountered retrieval problems.

Test Case 2: Historical Events Display

Objective: Verify that the application correctly retrieves and displays historical events.
Steps:

Submit a valid API key: After entering the valid API key, I submitted it.
Wait for Data Fetching: The application was trying to fetch historical events.
Watch Error Messages: Error messages were displayed indicating data retrieval errors.

Image description: An error message stating "Failed to generate image for 1905: Failed to fetch".

Expected results:

Events should be fetched and displayed for the current date.
Each event should display its year and description.
Events should be properly formatted and readable.

Actual results:

The application displayed error messages instead of historical events.
No events were displayed on the interface.

Test Case 3: Image Generation Error Handling

Objective: Verify that the application handles image generation errors gracefully.
Steps:

Observe image generation attempts: After data retrieval, the application attempted to generate images for events.
Monitor error handling: Errors occurred during image generation.

Image description: Error messages are displayed for image generation errors.

Check UI responsiveness: Ensures that the application remains responsive despite errors.

... and after a few minutes, here's the final report

Simply stunning.

I tried a few more test cases for other applications and the results were very similar. In my opinion, this new technology is unprecedented.

There are a few important things to bear in mind:

The speed of execution is nowhere near as fast as a fully automated test, not even Selenium or Playwright. But it is definitely much more flexible, cheaper and scalable than anything else out there.
Keep an eye on the cost and length of sessions (see my previous posts on the art of code generation and the previous posts on this blog).
White vs. black box testing: as mentioned above, even if you don't have access to the code, you can let Claude explore the application and find his way around with a clear goal in mind. In this case, make sure you give him all the necessary information in advance. Otherwise, he may stop and ask for input (which is good).
Avoid any registration dialogs where it might stumble on a model safety violation. You can read more about this in my first post about using Claude's computer.
Scaling tests for large projects: Context length is currently one of the common limitations of all transformer-based LLMs. That is, the more the context length increases, the more you pay, and at some point you will hit a limit where you have to reset the session.
Make use of lessons learned from "difficult" conversations: if you come across a conversation where Claude is stubborn and reluctant to do what you expect, but eventually manages to solve the problem, let him create a lessons learned document and feed it to him the next time he does something similar.
This is one of the most promising ways to learn from experience without fine-tuning models.
You can read more about this in my previous post "How to let Claude learn from his mistakes" or in my recent post "How to retain knowledge from ChatGPT".

Wrap-up
In this blog post we went through a simple but real example of exploratory testing of a web application using Claude's computer. In this exercise, Claude managed to generate well-structured test cases, execute them all by himself, without writing a single piece of test code, and with changing UI elements - all of which did not matter to him because he is based on a feedback loop where he observes the UI and figures out by himself, with very powerful reasoning, which UI element to use to accomplish his task.

A one final thought...

So is this the end of the software tester??

Well, that depends. I believe, like many "threatened" industries (including mine), that the only way to deal with the threat is first to understand it, and second to find ways to use it in your day-to-day work. That already sets you apart and gives you leverage to work on harder things that AI has not yet touched.

Comments ( )

Comments ()