Structured Data Extract, Data Mining with Bright Data & Google Gemini

Target Audience

This workflow is designed for:
- Data Analysts looking to extract and analyze structured data from web sources.
- Developers seeking to automate data extraction and processing tasks using modern AI tools.
- Businesses in need of insights from web data to drive decision-making and strategy.
- Researchers who require efficient methods for gathering and analyzing data from various online platforms.

Problem Solved

This workflow addresses the challenge of structured data extraction from web pages, enabling users to:
- Automatically gather data from specified URLs without manual intervention.
- Utilize advanced AI models like Google Gemini to analyze and extract meaningful insights from the data.
- Generate structured outputs such as topics and trends that can inform business strategies or research findings.

Workflow Steps

Trigger the Workflow: The process starts when the user clicks on the ‘Test workflow’ button.
2. Set URL and Zone: The workflow sets the target URL (e.g., https://www.bbc.com/news/world) and the Bright Data zone for web unlocking.
3. Perform Bright Data Web Request: The workflow sends a request to Bright Data to unlock and retrieve the content from the specified URL in Markdown format.
4. Markdown to Textual Data Extraction: The retrieved Markdown is converted into plain textual data using the Markdown to Textual Data Extractor node.
5. Topic Extraction: The workflow analyzes the textual data to identify key topics using the Topic Extractor node, generating structured information about each topic.
6. Sentiment Analysis: The extracted topics are analyzed for sentiment using the Google Gemini Chat Model for Sentiment Analyzer.
7. Trends Analysis: The workflow identifies emerging trends by location and category from the data.
8. Webhook Notifications: Throughout the process, webhook notifications are sent to specified URLs with the results of data extraction and analysis.
9. File Writing: Finally, the structured data is saved to disk in JSON format, allowing for easy access and further analysis.

Customization Guide

Users can customize this workflow by:
- Modifying the URL: Change the URL in the ‘Set URL and Bright Data Zone’ node to target different websites.
- Adjusting Parameters: Update the parameters in the Perform Bright Data Web Request node to change how data is fetched (e.g., data format).
- Changing AI Models: Users can select different models in the Google Gemini Chat Model nodes for varied analysis techniques.
- Altering Output Schemas: Customize the output schemas in the Topic Extractor and Trends Analysis nodes to fit specific data requirements.
- Updating Webhook URLs: Change the webhook URLs in the Initiate a Webhook Notification nodes to send results to different endpoints.