Now Hiring: Are you a driven and motivated 1st Line IT Support Engineer?

MASSIVE Step Allowing AI Agents To Control Computers (MacOS, Windows, Linux)

1722135627_maxresdefault.jpg

MASSIVE Step Allowing AI Agents To Control Computers (MacOS, Windows, Linux)

Unlocking AI Capabilities: How OS World Transforms AI Agents Into Computer Maestros

In today’s rapidly evolving digital landscape, Artificial Intelligence (AI) has been making incredible strides. However, one of the main challenges in the AI domain is effectively benchmarking AI agents to ensure they not only understand tasks but perform them competently across various operating systems like MacOS, Windows, and Linux. This is crucial for refining AI abilities, but until now, consistency and thoroughness in AI testing have been elusive. Enter OS World, a groundbreaking project poised to transform the way AI agents interact with and control digital environments.

Introduction to OS World: A Benchmarking Revolution

OS World is an innovative solution designed by a collaborative team from the University of Hong Kong, CMU, Salesforce Research, and the University of Waterloo. This project isn’t just a theoretical research paper; it’s a fully implemented, open-source toolkit that addresses the benchmarking challenges AI agents face when performing in real computer environments. By releasing both the code and data openly, OS World invites developers worldwide to explore and enhance AI capabilities extensively.

Understanding the Challenge: The Need for Grounding in AI Agents

Grounding is a concept where AI translates instructions into action, a process presently fraught with inefficiencies, especially within closed systems like MacOS or Windows. Historically, AI agents relied on accessibility features and crude methods like overlaying a grid on screenshots – a method far from precise and highly inefficient. OS World promises to redefine this process, offering AI agents the necessary tools to interpret and execute tasks directly within any computer environment.

OS World Mechanics: How It Works

At its core, OS World provides a "sandbox" where AI agents can freely interact with multiple operating systems, execute various applications, and respond to changes within those environments. This project incorporates several operational facets:

  • Environments and Tools: Supports a range of environments and tools including SQL, Python, and Hugging Face, providing an eclectic mix crucial for AI agents.
  • Intelligent Agent Framework: AI agents in OS World perceive their environment via sensors and act upon it, showing properties like autonomy and reactivity – crucial traits for sophisticated AI operations.
  • Multimodal Interaction: Through OS World, AI agents interact using both graphical user interfaces (GUIs) and command-line interfaces (CLIs), handling tasks involving multiple apps and complex workflows.

Real-World Applications and Testing Environment

OS World doesn’t just simulate generic tasks; it generates real-world applications. For example, managing bookkeeping data from image receipts or updating a computer’s wallpaper – tasks that involve interpretations of both visual and textual data. The evaluation process in OS World analyses task performance based on set parameters, like verifying the absence of specific cookies in browser settings post-task, ensuring results reflect real-world utility.

Impact and Future Prospects

By providing a robust testing framework, OS World is set to significantly enhance the AI development cycle. It allows developers to monitor, tweak, and retest AI agents in lifelike settings, leading to faster improvements and more reliable AI applications. Future integrations envisioned include more intuitive interactions between AI agents and operating systems, potentially embedded directly within system architectures.

Conclusion: Toward a More Interactive Digital Future

OS World marks a substantial progression in AI-agent benchmarking and operation. It represents a leap towards creating more intelligent, responsive, and capable AI systems capable of performing an array of tasks across various environments with minimal human oversight. For developers and technologists, OS World offers an invaluable resource that promises to catalyze further innovations in AI functionality and integration.

In sum, OS World is not just a tool but a transformative framework set to redefine the boundaries of AI capabilities, making it an essential development in the quest for truly autonomous and versatile AI agents. Whether it’s changing a desktop background or managing complex data entries, OS World equips AI agents with the skills needed to navigate and manipulate digital worlds with unprecedented effectiveness.

[h3]Watch this video for the full details:[/h3]


OS World gives agents the ability to fully control computers, including MacOS, Windows, and Linux. By giving agents a language to describe actions in a computer environment, OS World can benchmark agent performance like never before.

Try Deepchecks LLM Evaluation For Free: https://bit.ly/3SVtxLJ

Be sure to check out Pinecone for all your Vector DB needs: https://www.pinecone.io/

Join My Newsletter for Regular AI Updates 👇🏼
https://www.matthewberman.com

Need AI Consulting? 📈
https://forwardfuture.ai/

My Links 🔗
👉🏻 Subscribe: https://www.youtube.com/@matthew_berman
👉🏻 Twitter: https://twitter.com/matthewberman
👉🏻 Discord: https://discord.gg/xxysSXBxFW
👉🏻 Patreon: https://patreon.com/MatthewBerman
👉🏻 Instagram: https://www.instagram.com/matthewberman_ai
👉🏻 Threads: https://www.threads.net/@matthewberman_ai

Media/Sponsorship Inquiries ✅
https://bit.ly/44TC45V

[h3]Transcript[/h3]
today we have a really interesting project one of the biggest hurdles for AI agents is actually how to test them how to know if they’re doing things correctly and that’s really the only way for them to improve but today there’s not really a way to do it consistently and thoroughly until now a new project called OS World aims to fix the benchmarking problem for AI agents and it’s not only a research paper they also release the code they release the data as well basically everything is open- source and I really appreciate that so we’re going to talk about it today I’m going to tell you all about it and it’s super interesting so let’s get into it so first here’s the research paper osor benchmarking multimodal agents for open-ended tasks in real computer environments and it’s out of the University of Hong Kong CMU Salesforce research and University of watero and the project actually comes with a presentation which I think did a Fant fantastic job of explaining what is going on so I’ll show you that in a minute but the gist is to date we haven’t had a great way to Benchmark AI agents to allow them to perform actions in an environment and actually test and get the results of how well they’re performing and that’s what osor aims to fix it gives agents a robust environment multiple operating systems a way to interact with the environment and a way to actually measure the performance so first let’s go over the slides because this is such a great presentation I think it sums it up really well so this is by tal U out of the University of Hong Kong just came out a few weeks ago so the first page shows Ikea furniture assembly and it’s trying to set up an analogy for how humans take instructions and actually execute those instructions so on the left we have Ikea assembly instructions and then on the right we have the assembled chair but what happens in between those two things first on the left we have those step-by-step plans but that’s not enough that’s not actually enough just having the step-by-step plans is not enough to go assemble a chair we need grounding we need to actually know how to take those step-by-step instructions and execute them and that execution step which also includes getting feedback and perceiving the world and in this example perceiving the different steps of building the chair are incredibly important for actually executing the task successfully and here they’re call that the grounding so now let’s look at a digital task we have computer tasks in a digital world task instruction how do I change my Mac desktop background so we have the Mac OS environment we have our control instructions which is basically just from the help on the Apple website and then we have the final outcome of Mac OS with new wallpaper but again how do we get from just the instructions to the final executed task we need grounding and grounding in this case comes in the form form of a mouse and keyboard now it’s already difficult just based on that to date I believe open interpreter is probably the best at taking instructions and being able to actually control the computer and it’s really difficult to do so because you know Mac is a closed system Windows is a closed system and so basically what they do is they typically take a screenshot of the entire desktop then they put a grid over it and then the large language model tells the mouse and keyboard where to move on that grid but it’s all done through accessibility features and it’s imprecise to say the least so it’s really a very inefficient way of controlling a computer and let’s see what’s next so can llms and VMS be used for these tests so the answer is yes and no according to this presentation so let’s ask chat GPT on the left side how do I change my Mac desktop background and chat GPT gives us perfect instructions so step-by-step instructions and then for real world Tas it can’t really help with an Ikea chair right cuz you ask it how do I assemble an Ikea chair and it gives you only the most high level information about how to do that so let’s look at the yes let’s look at how to actually execute digital instructions so again on the left we have the step-by-step instructions on how to change the Mac desktop background and what we need is control instructions how do I take the actual step-by-step instructions and control the computer and that grounding is the missing piece because there is no really cut and dry way to control a Mac desktop for example again it’s usually take a screenshot place a grid over it and try to guess what the coordinates are very imprecise and it says right here Chachi PT cannot execute tasks on your Mac by grounding plans into actions and as the second example the real world example chaty PT also cannot generate step-by-step plans without interacting in the environment so basically without getting that feedback and how’s it going to get feedback from The Real World environment without a lot of sensors which we don’t really have right now then it’s not actually able to go give you really solid instructions for how to do it now before we get to the solution this presentation talks about what are actual agents so we have a user over here the user gives an instruction the llm as an agent is able to code actions it has a bunch of actions it can do and it’s basically able to code it so here we have SQL we have a P calls we have web and app control so actually being able to control the desktop and even an embodied AI in the form of a robot so we can actually use large language models to generate code that controls a robot then we have the environment and that is like the Mac OS or Windows but we have more than that we have the data we have websites we have apps we have mobile desktop and we have the physical world all the different environments that we should be able to operate within then we need to be able to gather observations and place those back into the large language model because this this is going to be an iterative Loop it needs to plan it needs to perform and then it needs to observe and use that information to iterate once again and then of course we have whatever tools we want to use hugging face SQL python Etc so here they say what is an intelligent agent and I’ve never actually heard that term before but let’s take a look at what it says so the definition is an intelligent agent perceives its environment via sensors and acts rationally upon that environment with its effectors a discrete agent receives percepts one at a time and Maps this percept sequence to a sequence of discrete actions so let’s take a look at this little funny looking chart that they have here we have the agent the agent can gather input through sensors in the form of percepts then it can plan and actually perform actions via its affectors things that can actually affect the environment so the properties of an intelligent agent are it’s autonomous it’s reactive to the environment it’s proactive gold directed and it interacts with other agents via the environment so this is all really cool and I keep thinking back to crew Ai and autogen and other AI agent Frameworks because this feels very akin to that so let’s look at some examples of what this can be the environment can be a computer mobile data or the physical world if you’re using an embodied AI agent then for the sensors we can use camera screenshot ultrasonic radar and now I’m kind of thinking of Tesla autonomous vehicles acting as agents then we have the agent itself where the llm vlm is the brain and the affectors so that’s the robot or The Interpreter so here’s an example of a robotic physical world agent so we give the instructions stack the blocks on the empty bowl we have the code right here so block name detect blocks detect objects and basically stack them up this is all the code necessary to do that that is in the actions so here are all the different options for the actions independent of the environment that we’re working within and here’s again the basic workflow so we’ve already talked about this but it also says we have to be able to interpret abstract user instructions utilize tools and expand capacities explore complex unseen environments multi-step planning and reasoning and follow feedback and self debug these are all part of being an agent all stuff that we’ve already seen and so it seems one of their big in ation is this x Lang which basically takes natural language instructions and translates that into code that can be executed in an environment and so here’s the XL website you can see very similar to what we’ve been seeing already and here’s the xlang GitHub page where you can get the open agents project as well as OS world and these are all open source which is awesome thanks to the sponsor of this video deep checks deep checks helps teams building llm applications evaluate Monitor and debug their llm based applications with deep checks you can release highquality llm apps quickly without compromising on testing imagine building a rag based chatbot application you do not want it to hallucinate or have inaccuracies hallucinations incorrect answers bias deviations from policy harmful content and more need to be detected explored and mitigated before and after your app goes live easily compared versions of your prompts and models to pentest your llm based app for undesired Behavior to enhance their text annotation efforts with automated scoring or to monitor the actual quality of your llm based app in production it allows you to create custom properties and rules to evaluate your llm applications based on your requirements deep check supports rag chatbots Q&A summarization text to SQL text to code and other content generation deep checks lln evaluation solution is currently available for free trials if you’re building any kind of llm based application you should definitely check out deep checks I’ll drop the link in the description below so you can get your free trial thanks again to deep checks and now back to the video so they’ve actually published a bunch of work and projects recently they have instructor which is uh adapting to various agent environments by simply providing instructions binder which is one of the first llm Plus tool used studies lemur open state-of-the-art llms for language agents open agents an open platform for language agents in the wild this is an agents project that I actually haven’t tested yet I not even heard of it before reading about it here we have text to reward which connects a large language model agents to the physical world and then OS World which is what we’re talking about today okay so that’s enough Theory let’s actually talk about how it’s working so here’s an example so I zoomed in and we have computer tasks often involve mult multiple apps and interfaces so the instruction example that is given here update the bookkeeping sheet with my recent transactions over the past few days in the provided folder I am so blown away that if not today eventually and pretty soon I believe agents are going to be able to take complex task instructions like this and actually go execute them on our behalf that’s why I’m excited about agent Frameworks that’s why I’m excited about the rabbit device hopefully you’re seeing the potential of Agents so in this example we have the operating system right here they need to open up office they actually need to open up different images which contain receipts and then they need to read the image look for the different line items the different prices and input those into the spreadsheet this is very complex but how do agents actually do that it is incredibly difficult to do that in the Mac OS environment in Windows environments because there’s no grounding layer there’s no ability to take those instructions and actually generate the instructions to interact with the environment and so that’s where osor comes in which is the first scalable real computer environment osor can serve as a unified multimodal agent environment for evaluating open-ended computer tasks that involve arbitrary apps and interfaces across operating systems So within this environment they can operate any of the operating systems they can operate any amount of applications within that and they can even operate the interfaces themselves both the UI and the CLI and it’s able to provide observations to the agents the agents are able to use grounding to actually generate instructions for how to interact with the computer environment so let’s look at what an agent task includes an autonomous agent task can be formalized as a primarily observable marov decision process so we have the state space so the current desktop environment we have the observation space which is the instruction the screenshot the Ally tree which I’ll show you in a minute and then we have the action space what they can actually do so being able to click then we have the transition function and the reward function so when each task is generated we have this initial State that’s located in this task config so here’s the instructions right here we have the config which is the current state we have the evaluator so basically how do we tell if the task is completed or or not we have the result compared to the expected and then we have the function with different options so how does it actually get the observations well there’s really a few ways that they describe here we have the set of marks and the accessibility tree the set of marks is kind of like a grid format it basically just tells it how to click on different objects within the screen and this is very Ain to how open interpreter works today it basically has a grid and it decides where to click now rather than figuring out what the grid is and figuring out where each button is within the grid that is the point of os world it is actually telling the language model where everything is and then we have the accessibility tree which is basically a code version of that so the AI agent generates an action which results in a new state and then a new observation so here is an example of the actual interaction with the environment we have the mouse moving we have clicking we have writing text we have pressing the keyo board we have using hot Keys scrolling dragging key up key down waiting failing done so that is how it actually interacts with the environment so how are the task executions actually evaluated and that’s what we’re seeing here so we have the task instruction as an example we have this initial State can you help me clean my computer by getting rid of all the tracking things that Amazon might have saved so the evaluation script which I guess is just a simplified version an example version of it is it actually check checks it grabs the cookies and sees does amazon.com have any cookies left and if not it passed and if so it failed then over here we have rename sheet one to L’s resources then make a copy of it place the copy before sheet two rename it by a pending etc etc and again it’s simply checking whether it was done or not so this is a great way to Benchmark in a really accurate way so they created 369 real world computer tasks that involve real web and desktop app in open domains they use OS file reading and writing they do multi-app workflows through both the GUI and the command line and each example task are carefully annotated with real world task instructions from real users an initial State setup config to simulate human work in progress and a custom execution based evaluation script so let’s actually look at the prompt this is what the actual prompt looks like so they tested it against Cog agent which I had not heard of GPT 4 Gemini Pro Cloud 3 as agents then we have the prompt details which we’re seeing over here so you’re an agent which follow my instructions and perform desktop computer tasks as instructed you have good knowledge etc etc you are required to use Pi Auto GUI to perform the action grounded to the observation return one line or multiple lines of python code to perform each of the actions you need to specify the coordinates of by yourself based on the observation of the current observation and here’s a password you can use so uh really just a pretty thorough prompt to give to the large language model the temperature of one which I thought was interesting because that means that it’s going to be the most creative basically and a top PE of 0.9 now I would think it would want to keep the temperature really low uh I’m not actually sure why they decided to keep the temperature at one and then they also provide the most recent three observations and actions as history context for each step basically helping the large language model understand what has come before and that will inform what it needs to do going forward and as input settings as we’ve already talked about they have set of marks and accessibility tree and they actually have four different versions of it they have accessibility tree only screenshot only screenshot plus accessibility tree and set of marks now let’s look at the results so on the left we have the different input modes so the Ally tree which is the accessibility tree we have the screenshot we have the accessibility tree plus screenshot and then the set of marks so what we have found is first of all gp4 across the board has been the winner The Only Exception is with screenshot only mode which in Gemini Pro V did the best and it seems that either the accessibility tree or using a screenshot plus the accessibility tree are giving the best result with really the accessibility tree alone being really the winner of the best way to give observation to the large language model the setup of Mark actually also works pretty well the screenshot alone does not and that’s interesting because I still believe open interpreter the way that they’re actually interacting with computers is by screenshot now if you’re trying to deploy agents to a real world environment consumers are not going to have OS World installed in their machines or expose OS world as an environment so what we’re probably going to have to do in the long run is build in two operating systems a way for agents to interact with them more effectively and one interesting Insight that they had is higher screenshot resolution typically leads to improved performance so if they’re just doing screenshots you can see right here the success rate and percentage increases as the resolution of that screenshot improves so that’s it that is the OS World project I really like it I appreciate that it’s going to allow us to Benchmark agents testing and actually having results is the only way to improve anything I’m thinking about actually setting up OS world on my own machine testing it out maybe I’ll create a tutorial from it if you want to see that let me know in the comments below if you liked this video please consider giving a like And subscribe and I’ll see you in the next one