Cjwbw

5.3K

Remove images background

Updated 4/28/2024

clip-vit-large-patch14

4.5K

openai/clip-vit-large-patch14 with Transformers

Updated 4/28/2024

Image-to-Text

anything-v3-better-vae

3.4K

high-quality, highly detailed anime style stable-diffusion with better VAE

Updated 4/28/2024

zoedepth

3.4K

ZoeDepth: Combining relative and metric depth

Updated 4/28/2024

anything-v4.0

3.0K

high-quality, highly detailed anime-style Stable Diffusion models

Updated 4/28/2024

real-esrgan

1.4K

Real-ESRGAN: Real-World Blind Super-Resolution

Updated 4/28/2024

dreamshaper

1.2K

Dream Shaper stable diffusion

Updated 4/28/2024

waifu-diffusion

1.1K

Stable Diffusion on Danbooru images

Updated 4/28/2024

cogvlm

533.724

CogVLM is a powerful open-source visual language model developed by the maintainer cjwbw. It comprises a vision transformer encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module. CogVLM-17B has 10 billion vision parameters and 7 billion language parameters, and it achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, and more. It can also engage in conversational interactions about images. Similar models include segmind-vega, an open-source distilled Stable Diffusion model with 100% speedup, animagine-xl-3.1, an anime-themed text-to-image Stable Diffusion model, cog-a1111-ui, a collection of anime Stable Diffusion models, and videocrafter, a text-to-video and image-to-video generation and editing model. Model inputs and outputs CogVLM is a powerful visual language model that can accept both text and image inputs. It can generate detailed image descriptions, answer various types of visual questions, and even engage in multi-turn conversations about images. Inputs Image**: The input image that CogVLM will process and generate a response for. Query**: The text prompt or question that CogVLM will use to generate a response related to the input image. Outputs Text response**: The generated text response from CogVLM based on the input image and query. Capabilities CogVLM is capable of accurately describing images in detail with very few hallucinations. It can understand and answer various types of visual questions, and it has a visual grounding version that can ground the generated text to specific regions of the input image. CogVLM sometimes captures more detailed content than GPT-4V(ision). What can I use it for? With its powerful visual and language understanding capabilities, CogVLM can be used for a variety of applications, such as image captioning, visual question answering, image-based dialogue systems, and more. Developers and researchers can leverage CogVLM to build advanced multimodal AI systems that can effectively process and understand both visual and textual information. Things to try One interesting aspect of CogVLM is its ability to engage in multi-turn conversations about images. You can try providing a series of related queries about a single image and observe how the model responds and maintains context throughout the conversation. Additionally, you can experiment with different prompting strategies to see how CogVLM performs on various visual understanding tasks, such as detailed image description, visual reasoning, and visual grounding.

Updated 4/28/2024

rudalle-sr

464.132

Real-ESRGAN super-resolution model from ruDALL-E

Updated 4/28/2024