Skip to content

Setting Up the Web Scraper in Unli.ai

Overview

The web scraper in Unli.ai allows you to extract website content and store it as files for easy retrieval. The scraped data is saved in your S3-compatible object storage and automatically processed into your default vector store (currently Pinecone).

Prerequisites

Before setting up the scraper, ensure you have:

  • A configured vector store (e.g., Pinecone).
  • An S3-compatible object storage set up for storing scraped content.

Steps to Add a Web Scraper

1. Select Scraper as Data Source

  • Navigate to the Datasources section in your Unli.ai project.
  • Click Add new data source and select Scraper from the dropdown.

2. Configure Scraper

  • Assign a name for your records.
  • Select the Storage (object storage required) where scraped content will be stored.
  • Click Add Scraper to complete the setup.

3. Start Scraping

  • After adding the scraper, enter a URL to scrape content from.
  • The scraped content will be stored in your object storage as TXT files, following this format:
    domain.com/page.txt
    domain.com/category/content.txt
    and so on.

4. Automatic Processing into Vector Store

Once the scraped pages are stored:

  • They will be automatically processed and added to your default vector store (Pinecone).
  • This allows the content to be indexed and searchable within your Unli.ai project.

5. Scraping Service Provider

Scraping services are powered by Tinq.ai, ensuring efficient and accurate data extraction.

Once set up, you can easily scrape website content, have it stored, and automatically indexed for AI-powered knowledge retrieval in Unli.ai!