News Extraction

For the News Extraction workflow, automate the collection of the latest news articles from a website, summarizing each post and extracting key technical keywords. This process runs weekly, ensuring timely updates and efficient data management in a NocoDB database, enhancing accessibility and organization of news content.

7/4/2025
36 nodes
Complex
schedulecomplexopenaisticky noteschedule triggernocodbitemlistsautomationadvancedapiintegrationcron
Categories:
Schedule TriggeredComplex Workflow
Integrations:
OpenAiSticky NoteSchedule TriggerNocoDbItemLists

Target Audience

This workflow is ideal for:
- Content Creators: Those who regularly produce articles or summaries based on news updates.
- Marketing Professionals: Individuals looking to stay updated on industry trends and news for better content marketing strategies.
- Data Analysts: Analysts who need to extract and summarize information from various news sources efficiently.
- Developers: Those interested in automating data extraction and processing tasks using APIs and web scraping techniques.

Problem Solved

This workflow addresses the challenge of automatically scraping news articles from a website that does not provide an RSS feed. It simplifies the process of gathering, summarizing, and extracting key information such as keywords and publication dates, allowing users to stay informed without manual effort.

Workflow Steps

  • Schedule Trigger: The workflow is initiated on a weekly basis at 4:32 AM on Wednesdays.
    2. Retrieve Web Page: It fetches the HTML content from the news website, specifically from https://www.colt.net/resources/type/news/.
    3. Extract Links and Dates: The workflow extracts the relevant links and publication dates of the articles from the HTML content using specific CSS selectors.
    4. Filter Recent Posts: It filters the articles to include only those published within the last 7 days.
    5. Extract Individual Posts: For each article link, it makes an HTTP request to retrieve the full content of the individual posts.
    6. Summarize Content: The extracted content is summarized using the OpenAI API, generating concise summaries of less than 70 words.
    7. Identify Keywords: The workflow identifies the three most important technical keywords from each article using the OpenAI API.
    8. Merge Data: It merges the summarized content, keywords, publication dates, and links into a single dataset.
    9. Store in Database: Finally, the structured data is stored in a NocoDB database for further processing or retrieval.
  • Customization Guide

    To customize this workflow:
    - Change the Schedule: Adjust the schedule trigger settings to fit your preferred timing.
    - Modify CSS Selectors: If the website structure changes, update the CSS selectors in the extraction nodes to ensure correct data retrieval.
    - Adapt Summarization Parameters: Alter the summarization length or prompt in the OpenAI nodes to fit your content needs.
    - Change Database Configuration: Update the NocoDB node parameters to point to a different database or table as required.
    - Add Additional Processing Steps: Include more nodes for further data processing, such as sending notifications or integrating with other applications.