Building AI Tools for Accessibility and Creativity

Latest projects using GPT-4, Stable Diffusion Img2Img, AutoGPT, and Whisper

May 04, 2023

Much of April I’ve been back in DC and focusing on building, so lots of updates here on projects. I’m currently exploring what to make next as I continue to build out a portfolio of AI projects. If you have interesting problems to work on, or if you’re a builder in this space and recommend going deeper on certain tools — let’s chat!

TLDR;

I was accepted into an AI hackathon in NY
(150 devs with 500 waitlist!). My team built Ally, showing how GPT4 can make the web more accessible.
Doodle & diffuse — collaborative drawing with AI — launched at the AI Adventureland event in Oakland.
I dove into Open AI’s Whisper, making a youtube transcription tool, and microphone voice to text web app.

Building things 🚀🚀🚀

Ally — Using AI to make the web more accessible. My team built Ally at the NYC AI Hackathon earlier this month. Given a URL, Ally uses GPT4 to read the html and css and serve back a website with improvements towards accessibility.

Here we're thinking about improving sites for people with visual impairments, but you can also imagine design improvements taking into account neurodiversity. How might the experience improve for readers with ADHD?
The updated CSS includes information on the improvements, ex. “Add ARIA role attribute for better screen reader support”, “Increase heading font-size for better readability.”
The challenge we faced was the limited context window. We worked around this by testing on smaller websites and also seeing how far we could get only reading the css.

Read more about the project in “Can AI help make the web more accessible?”

Doodle & Diffuse launch! I launched doodle & diffuse at AI Adventureland, an immersive event showcasing a range of creative AI projects.

Previously, the app was running locally with a temporary Flask server from a Colab notebook. This was good for prototyping, but difficult to share. I rewrote the app to run a server with NextJS, calling the stable diffusion img2img endpoint via Replicate.
The output images are way more abstract than the earlier prototype. I’m using a newer model, so this was surprising. I’m still tweaking the parameters, and may use a different model or scheduler completely.
I’m not sharing the app publicly at the moment because of model costs, but if you’re interested in testing it out, I’m happy to share a link!

I’m interested in adding additional features to help make the app more welcoming. These include:

Prompt writing magic: A button that will make your prompt more descriptive, whimsical, and just generally lead to a fun image.
Quick background image: It can be easier to have a background image to draw on instead of a blank canvas. Often with this app I turn the prompt strength up and generate an image before doodling. I’d make this workflow easier to do. Perhaps at the click of a button, turn up strength and cycle through some interesting backgrounds.
ControlNet integration: ControlNet allows you to draw black & white and get a color image. This could be a fun option to get started with the app.

Playing with AI Agents. There’s been a ton of conversation and hype building around AI agents over the last month, especially with the launch of AutoGPT. Read on for my exploration of this in “Getting a feel for AI Agents, the thought-action-observation loop, and the materiality of AI.”

I used AutoGPT to run some market research and problem discovery, first in generative AI, and then in genAI specific to medical applications.
You can use chatGPT to get a feel for the AI thought, action, observation loop without programming it yourself. You can also explore other reflective loops. I do this for an AI feelings loop: FEEL, INTENT, SAY, REFLECT, FUTURE, PAUSE.
These agents aren’t as reliable as viral threads may make them seem. Often examples are the exception that’s worked. Still, they’re interesting to explore, and will improve in reliability over time.

Building with OpenAI’s Whisper API. I’ve been working with OpenAI’s Whisper for a couple projects. Whisper is a powerful API that takes voice recordings and transcribes it (or translates it too!).

Youtube Transcript: After watching an instructional video on youtube, I wanted to revisit certain parts of it, and pull sections of it into my notes. I used Whisper to convert the video into text. It’s a command line tool that takes in the youtube url and outputs the transcript. You can find it here. The output just comes as a wall of text. A future feature I’d love to have would be to include formatting on the output too, breaking the text into sections based on the content itself.
Voice to Text Transcription with OpenAI Whisper and NextJS. A couple months ago I shared a meditation app using chatGPT. It’s an interactive chat app that follows an internal dialogue one may have during meditation — specifically around redirecting the attention to the body or the breath. I’m interested in making this completely voice based so I can put on headphones, close my eyes, and just have an interactive guided meditation. Integrating Whisper into this app was a bit more challenging than the command line tool, and involved working with streaming audio buffers. Now that it’s working, I uploaded an example for others interested in doing this too. You can find it here.

Conversations on problem discovery

I put together a problem discovery guide for genAI startups in the B2B space as I navigate these questions. The guide covers use cases, features, and a set of questions to explore problems with potential customers. Here are a few of them:

What’s a business process in your company that takes a lot of time or resources to complete?
Do you do writing that is tedious? Repetitive?
Are there documents or data in your workflow that you want to have a quick summary, blurb, key insights about?
If you had an additional assistant or intern, what would you delegate to them?
- If you have tasks for them, why don’t you get an extra intern or assistant?
- If the cost is too high, at what cost would you consider this given your tasks?

Each of these questions seeks to better understand the problems and frustrations that may come up at a company, that also intersect with genAI features. So far, my conversations have touched upon medical applications, climate tech, manufacturing, pharma, and legal tech. As expected, these spaces go deep. If you start a company serving the space, you’d need to have strong relationships and trust with companies and leaders in the given space. The tech part isn’t the hard part.

What’s next

Having tons of conversations! Learning about different spaces where AI can have impact, and jamming on B2B opportunities.
My meditation app with audio now works on desktop, currently updating it to work on mobile web.
More open LLMs are becoming available. I’d like to use some of these myself, and webLLMs that leverage webGPU.
Following from the work with Replicate, I’m interested in looking into additional options for deploying AI apps.

Ideas and inspo

As I look into problem discovery, I’m inspired by the concept of the idea space from “How to Get and Evaluate Startup Ideas” by Jared Friedman (and yes, I used yt-transcribe to get this text):

An idea space is like one level of abstraction out from a particular startup idea. It is a class of closely related startup ideas like software for hospitals or infrastructure monitoring tools or food delivery services. Different idea spaces have wildly different hit rates. Over the last 10 years, if you started a company that did FinTech infrastructure or vertical SaaS for enterprise, the probability that your company became a billion dollar company was astonishingly high. Whereas if you started something in consumer hardware or social networks or ad tech, the success rate was orders of magnitude lower.

Cristobal, co-founder of RunwayML, gave a talk at the NY AI Hackathon last month. The origin of RunwayML was actually at a hackathon 5 years ago. Here’s what he shared to attendees:

If you're trying to build something right now, try to reason from first principles. Look at the simplest most basic truth, and build from the ground up. If you don't do that what you make will be obsolete in a few months.
In thinking about what to build from generative models, I would consider everyone on the team a scientist. The most important thing to do right now is experiment. Instead of building too much on top, put it out, test assumptions and learn. You’ll learn much more putting it out than thinking about it in a research environment.

Thanks for reading! You can follow what I’m up to by subscribing here. If you know anyone that would find this post interesting, I’d really appreciate it if you forward it to them! And if you’d like to jam more on any of this, I’d love to chat here or on twitter.

Kawandeep’s Newsletter

Discussion about this post