Structured Bulk Data Extract with Bright Data Web Scraper

Target Audience

This workflow is designed for:
- Data Analysts: Individuals who need to extract and analyze web data efficiently.
- Data Scientists: Professionals seeking to gather data for machine learning and statistical analysis.
- Engineers and Developers: Those looking to integrate web scraping capabilities into their applications or projects.
- Business Intelligence Professionals: Users who require structured data for reporting and decision-making processes.

Problem Solved

This workflow addresses the challenge of extracting structured bulk data from web sources using the Bright Data Web Scraper. It automates the entire process from initiating a scraping request to downloading and saving the data, ensuring that users can efficiently gather the required information without manual intervention.

Workflow Steps

Manual Trigger: The workflow starts when the user clicks ‘Test workflow’.
2. Set Dataset ID and Request URL: It assigns the specific dataset ID and request URL for the scraping task.
3. HTTP Request to Trigger Scraping: A POST request is sent to initiate the scraping process.
4. Set Snapshot ID: The workflow captures the snapshot ID from the response for further tracking.
5. Wait for Snapshot Completion: It pauses for 30 seconds to allow the scraping process to complete.
6. Check Snapshot Status: A request is made to check the status of the snapshot, ensuring it is ready for download.
7. Error Checking: If there are no errors, it proceeds to download the snapshot data.
8. Download Snapshot: The snapshot data is downloaded in JSON format.
9. Aggregate JSON Response: The downloaded data is aggregated for easier handling.
10. Webhook Notification: A notification is sent to a specified webhook URL with the response data.
11. Create Binary Data: The aggregated data is converted into a binary format for storage.
12. Write to Disk: Finally, the binary data is written to disk as a JSON file.

Customization Guide

To customize this workflow:
- Change Dataset ID: Update the dataset_id in the ‘Set Dataset Id, Request URL’ node to target a different dataset.
- Modify Request URL: Alter the request URL to scrape data from a different web page.
- Adjust Wait Time: Modify the amount in the ‘Wait’ node if a longer or shorter wait is needed for the scraping process to complete.
- Webhook Notification: Change the webhook URL in the ‘Initiate a Webhook Notification’ node to send notifications to a different endpoint.
- File Path: Update the fileName in the ‘Write the file to disk’ node to save the output file in a different location.