💡🌐 Essential Multipage Website Scraper with Jina.ai

Target Audience

This workflow is ideal for:
- Web Developers looking to automate the process of scraping multiple web pages.
- Data Analysts who need to gather data from various websites for analysis.
- Content Creators who want to collect information and resources from different sources efficiently.
- SEO Specialists aiming to track and analyze competitor websites.
- Researchers needing to gather and organize web content for their projects.

Problem Solved

This workflow addresses the challenge of manually scraping data from multiple web pages, which can be time-consuming and prone to errors. By automating the process, it allows users to:
- Efficiently gather data from various URLs without needing an API key.
- Filter and limit the data collected based on specific topics or pages.
- Save the extracted content directly to Google Drive for easy access and organization.

Workflow Steps

Manual Trigger: The workflow starts when the user clicks the ‘Test workflow’ button.
2. Set Website URL: The workflow assigns the sitemap URL from which to scrape data.
3. Get List of Website URLs: It retrieves a list of URLs from the specified sitemap.
4. Convert to JSON: The URLs are converted from XML format to JSON for easier processing.
5. Create List of Website URLs: The workflow creates a list of individual website URLs from the JSON data.
6. Filter By Topics or Pages: It filters the URLs based on specified conditions to focus on relevant content.
7. Limit: The workflow limits the number of items processed to prevent overload.
8. Loop Over Items: For each filtered URL, the workflow proceeds to scrape the content.
9. Jina.ai Web Scraper: It sends a request to Jina.ai to scrape the webpage content.
10. Extract Title & Markdown Content: The workflow extracts the title and markdown content from the scraped data.
11. Save Webpage Contents to Google Drive: Finally, it saves the extracted content to Google Drive, organizing it for future access.
12. Wait: The workflow pauses to manage the flow and ensure smooth operation.

Customization Guide

Users can customize this workflow by:
- Changing the Sitemap URL: Modify the value in the ‘Set Website URL’ node to scrape a different website.
- Adjusting Filters: Update the conditions in the ‘Filter By Topics or Pages’ node to focus on different topics or pages of interest.
- Modifying Data Saving Options: Change the parameters in the ‘Save Webpage Contents to Google Drive’ node to save files in different folders or with different naming conventions.
- Altering the Limit: Adjust the maximum number of items processed in the ‘Limit’ node to suit the workload.
- Adding More Nodes: Expand the workflow by adding additional nodes for further processing or analysis of the scraped data.