Crawler

Table of Contents

Ethical Considerations of Web Crawling, in Project nGene.org®


Comparative Analysis of Three Web Crawler Prototypes

(A) JavaScript-Based Web Crawler

(B) Python-Based Web Crawler Utilizing Requests and BeautifulSoup

(C) Selenium-Based Web Crawler Utilizing Browser Automation


Overview of the JavaScript-Based Web Crawler and Image Downloader Prototype


Source Code

Crawler JavaScript Source Code with Detailed Comments


Firefox Crawler Extension

Developing a Firefox Extension for Automatic Media Downloading

Automated Downloading in Firefox Extensions: Minimizing Detection and Ensuring Ethical Use (Written November 30, 2024)

nGeneAutomaticDownloader: Firefox Extension Documentation (Written November 30, 2024 V1.5)

nGeneAutomaticDownloader Extension v1.6 (Written November 30, 2024)


Ethical Considerations of Web Crawling, in Project nGene.org®

Project nGene.org is an advanced academic software designed to facilitate programming and research in the field of hemodynamics, integrating computational modeling, simulation, medical statistics, and machine learning. As part of its multifaceted approach, Project nGene.org employs web crawling (web scraping) to aggregate and analyze vast amounts of biomedical data from various online sources. This section delineates the ethical framework guiding the web crawling activities within Project nGene.org, ensuring that data collection practices align with legal standards and the project's academic integrity.



(A) Purpose of Web Crawling in Project nGene.org

Legitimate and Essential Uses:

Avoiding Potentially Unethical Uses:



(B) Respecting Website Policies

Adherence to Ethical Guidelines and Legal Frameworks:



(C) Data Privacy and Consent

Integrating AI Ethics and Data Protection Principles:

a. Ethical Foundations Inspired by AI Ethics:
b. Compliance with Data Protection Laws:
c. Advanced Privacy-Preserving Techniques:
d. Ethical Data Handling Practices:


(D) Impact on Website Performance

Ensuring Responsible Resource Utilization and Server Security:

a. Responsible Crawling to Minimize Server Load:
b. Robust Server Security Measures:
c. Sustainable and Ethical Resource Management:


(E) Intellectual Property and Copyright

Navigating Intellectual Property in the Software Era:

a. Understanding the Dual Nature of Software:
b. Copyright Limitations and Fair Use Considerations:
c. Paracopyright and Digital Rights Management (DRM):
d. Idea/Expression Distinction and Application in Software:
e. Promoting Open Innovation and Collaboration:


(F) Transparency and Accountability

Ensuring Openness and Responsible Stewardship:



(G) Compliance with Legal Frameworks

Adhering to Comprehensive Legal Standards and Regulations:

Data Protection Laws:
Anti-Circumvention Laws:
International Compliance:
Intellectual Property Laws:
Ethical Standards and Best Practices:


Best Practices Implemented by Project nGene.org for Ethical Web Crawling

  1. Respect Website Policies:
    • Action: Before initiating any crawling activity, Project nGene.org reviews and complies with the target website’s Terms of Service. This ensures that web crawling activities respect the access permissions and restrictions set by website administrators, maintaining a respectful and non-intrusive presence.
  2. Limit Request Rates:
    • Action: The software incorporates rate limiting mechanisms, spacing out requests to mimic human browsing patterns and prevent server strain.
  3. Identify Your Crawler:
    • Action: Utilizing a clear and descriptive User-Agent string that includes contact information, Project nGene.org ensures transparency in its web crawling operations.
  4. Avoid Collecting Sensitive Data:
    • Action: The project focuses solely on publicly available and non-sensitive data, deliberately avoiding the collection of personal, confidential, or restricted information.
  5. Respect Data Privacy:
    • Action: Project nGene.org adheres to data protection regulations, implementing robust data security measures to safeguard collected information.
  6. Provide Opt-Out Mechanisms:
    • Action: While not always directly controllable, Project nGene.org responds promptly to requests from website owners to cease crawling, respecting their preferences and maintaining ethical standards.
  7. Use Data Responsibly:
    • Action: The collected data is utilized solely for academic research and development within the field of hemodynamics, avoiding misuse or unauthorized distribution.
  8. Stay Informed:
    • Action: The project team remains updated on evolving laws, regulations, and best practices related to web crawling and data collection, ensuring ongoing compliance and ethical conduct.

Comparative Analysis of Three Web Crawler Prototypes

Project nGene.org has developed three distinct web crawler prototypes, each utilizing different programming languages and methodologies. These prototypes serve as foundational tools for automated data collection and analysis, essential for advancing research objectives. This analysis delineates the programming characteristics, advantages, and limitations of each version, providing a comprehensive understanding of their operational dynamics. It is important to note that these implementations are in the prototype stage, primarily designed for testing and evaluation purposes.



(A) JavaScript-Based Web Crawler

Programming Language and Environment

Architecture and Design

Advantages

Limitations

Potential Mitigation Strategies



(B) Python-Based Web Crawler Utilizing Requests and BeautifulSoup

Programming Language and Environment

Architecture and Design

Advantages

Limitations

Potential Mitigation Strategies



(C) Selenium-Based Web Crawler Utilizing Browser Automation

Programming Language and Environment

Architecture and Design

Advantages

Limitations

Potential Mitigation Strategies


Overview of the JavaScript-Based Web Crawler and Image Downloader Prototype

Project nGene.org has developed a prototype of a JavaScript-based web crawler and image downloader intended to automate the collection and analysis of web-based biomedical data. This client-side crawler operates within a web browser, enabling the input of a target website URL, specification of the depth of recursion, selection of specific HTML tags to search for, and the decision to limit crawling to the same domain. The following outlines the functionality of this prototype, the challenges encountered—particularly regarding Cross-Origin Resource Sharing (CORS) policies—and its inherent limitations, along with potential strategies to overcome these obstacles.




Web Crawler and Image Downloader

Download All Images





(A) Functionality and Features



(B) Issues and Limitations

While the JavaScript-based crawler offers a convenient and accessible means of data collection directly from the browser, several significant challenges are encountered:

1. Cross-Origin Resource Sharing (CORS) Policy

CORS is a security feature implemented by web browsers to restrict web pages from making requests to a different domain than the one that served the web page. This ensures that malicious websites cannot access sensitive data from other sites without permission.

2. Same-Origin Policy Constraints

The same-origin policy is a security measure that allows scripts running on a web page to interact only with resources from the same origin (i.e., same domain, protocol, and port). This restricts the crawler from accessing and processing content from external websites unless they are within the same domain or have been configured to allow such interactions.

3. Performance and Scalability

4. Handling Dynamic and JavaScript-Heavy Websites



(C) CORS Policy Issues and Their Implications

The enforcement of CORS policies presents a significant barrier to the crawler's effectiveness:



(D) Limitations of the Current Implementation



(E) Circumventing CORS and Overcoming Limitations

While CORS policies and other limitations present significant challenges, several strategies can mitigate these issues:

  1. Using a CORS Proxy:
    • Definition: A CORS proxy acts as an intermediary between the crawler and the target website, adding the necessary CORS headers to responses.
    • Benefits: By routing requests through a CORS proxy, the crawler can bypass browser-enforced CORS restrictions, enabling access to external resources.
    • Considerations: Public CORS proxies may have usage limitations or introduce latency. For large-scale or frequent crawling, setting up a dedicated CORS proxy server is advisable.
  2. Server-Side Crawling:
    • Approach: Shifting the crawling process to a server-side environment (e.g., using Node.js) bypasses browser-imposed CORS restrictions.
    • Advantages: Server-side crawlers are not subject to CORS policies, can handle larger-scale data collection, and can execute JavaScript if needed (using headless browsers like Puppeteer).
    • Implementation: Developing a backend service that performs crawling tasks and communicates results to the client-side application.
  3. Leveraging Browser Extensions:
    • Strategy: Creating a browser extension with elevated permissions can allow the crawler to access cross-origin resources by modifying request headers.
    • Limitations: Developing and distributing browser extensions requires additional effort and may introduce security risks if not properly managed.
  4. Using Headless Browsers:
    • Tools: Headless browsers like Puppeteer or Selenium can execute JavaScript, interact with dynamic content, and bypass some CORS restrictions by controlling browser behavior programmatically.
    • Benefits: Enhanced capability to handle complex websites and dynamic content, providing more comprehensive data collection.
    • Drawbacks: Requires running the crawler outside the standard browser environment, involving more complex setup and resource management.
  5. Implementing Rate Limiting and Throttling:
    • Purpose: To mitigate performance issues and reduce the risk of overloading target websites, strict rate limiting and request throttling can be implemented.
    • Method: Introduce delays between requests and limit the number of concurrent fetch operations.
    • Example Implementation:
      async function crawl(url, depth, tag, sameDomainOnly, visited = new Set(), failed = new Set(), baseDomain = null) {
         if (stopCrawling || depth < 0 || visited.has(url) || failed.has(url)) return;
      
         visited.add(url);
      
         // ... existing code ...
      
         // Introduce a delay between requests
         await new Promise(resolve => setTimeout(resolve, 1000)); // 1-second delay
      
         // ... continue crawling ...
      }
  6. Enhancing Error Handling:
    • Robust Error Logging: Improve error handling to gracefully manage CORS-related failures and provide meaningful feedback to users.
    • Retry Mechanisms: Implement retry logic for transient errors, possibly using exponential backoff strategies to manage repeated request failures.


Conclusion

The JavaScript-based web crawler and image downloader prototype integrated into Project nGene.org offers a user-friendly interface for automated data collection directly within the browser. However, significant challenges related to browser security policies, particularly CORS, and inherent limitations in handling dynamic content and maintaining performance are encountered. By adopting strategies such as using CORS proxies, shifting to server-side crawling, leveraging headless browsers, and implementing robust rate limiting, these limitations can be effectively mitigated. These enhancements will enable the prototype to perform more comprehensive and efficient data collection, thereby supporting the mission to advance hemodynamic research through accurate and extensive biomedical data aggregation.


Crawler JavaScript Source Code with Detailed Comments


Firefox Extension


Developing a Firefox Extension for Automatic Media Downloading

The ability to automatically download images and videos from webpages can enhance productivity and user experience. Implementing this functionality in Firefox can be approached in two primary ways: modifying Firefox's source code or developing a browser extension. This document provides an integrated overview of these methods, focusing on the creation of a Firefox extension due to its practicality and ease of maintenance.

Approaches to Implementing Automatic Media Downloading in Firefox

Modifying Firefox Source Code

Modifying the Firefox source code involves directly editing the browser's internal components to include the desired functionality. While this approach offers deep integration and control, it presents significant challenges:

Developing a Firefox Extension

Creating a Firefox extension, specifically a WebExtension, is a more practical solution. Extensions are easier to develop, maintain, and distribute. They operate within the browser's existing framework, providing the desired functionality without altering the core code.

Recommended Approach: Developing a Firefox Extension

Programming Languages Used

Firefox extensions utilize standard web technologies, making development accessible:

Overview of Extension Development

  1. Setting Up the Development Environment:
    • Install Firefox Developer Edition for advanced debugging features.
    • Enable "Developer Mode" in Firefox by navigating to about:debugging.
    • about:debugging#/runtime/this-firefox
  2. Creating Essential Files:
    • manifest.json: Defines metadata, permissions, and scripts.
    • Background Script: Handles media detection and download initiation.
    • Content Script: Interacts with webpages to collect media URLs.
  3. Implementing Functionality:
    • Media Detection: The content script scans webpages for visible images and videos, collecting their source URLs.
    • Automatic Downloading: The background script receives media URLs and uses the Downloads API to save files to the default download directory.
  4. Testing and Deployment:
    • Load the extension temporarily in Firefox for testing.
    • Optionally, package and publish the extension on Mozilla's Add-ons site for wider distribution.

Functionality of the Extension

How It Works

The extension operates by:

Handling Logged-In Websites

The extension can download content from websites requiring authentication because:

Possible limitations include:

Considerations Regarding Website Detection

Potential Detection Methods

Websites might detect automated downloading through:

Best Practices to Minimize Detection

Limitations and Legal Considerations

Platforms with DRM Protections

Websites like YouTube, Netflix, and other streaming services employ DRM technologies that prevent the downloading of their content. The extension:

Ethical and Legal Concerns

Users should be mindful of:

Written on November 29th, 2024


Automated Downloading in Firefox Extensions: Minimizing Detection and Ensuring Ethical Use (Written November 30, 2024)

Automated downloading and web scraping can inadvertently trigger detection mechanisms on websites, potentially resulting in blocks or other restrictions. Implementing best practices helps minimize the risk of detection while ensuring responsible and ethical use of automated tools within Firefox extensions. The strategies outlined below provide guidance on emulating human-like behavior, respecting website policies, and preventing server overload.

1. Throttling Downloads

Introducing delays between download requests is essential for mimicking human behavior. Randomized delays make automated activities less distinguishable from those of regular users.

function startDownloadWithDelay(item, delay) {
    setTimeout(() => {
        startDownload(item);
    }, delay);
}

// Use a random delay between 1 to 3 seconds
const randomDelay = Math.random() * 2000 + 1000; // 1000 to 3000 ms
startDownloadWithDelay(item, randomDelay);

In this example, startDownloadWithDelay introduces a delay before initiating the download. The delay is randomized between 1 to 3 seconds to prevent patterns that might be detected by automated systems.


2. Limiting Download Scope

Focusing on downloading only visible and relevant media reduces the volume of requests and aligns with typical user behavior.

function isElementInViewport(el) {
    const rect = el.getBoundingClientRect();
    return (
        rect.top >= 0 &&
        rect.left >= 0 &&
        rect.bottom <= (window.innerHeight || document.documentElement.clientHeight) &&
        rect.right <= (window.innerWidth || document.documentElement.clientWidth)
    );
}

The isElementInViewport function determines if a media element is within the visible area of the webpage. By downloading only these elements, the automation mimics typical user interaction with the page.


3. Respecting Website Policies

Adhering to a website's policies and guidelines is essential for ethical automation practices. The robots.txt file provides directives on how automated agents should interact with the site.

Accessing robots.txt

  1. Locate the File: Navigate to https://example.com/robots.txt, replacing example.com with the target domain.
  2. Parse the Content: Analyze the file to identify any restrictions applicable to automated downloading.

Example robots.txt Content

User-agent: *
Disallow: /private/

In this example, all user agents are instructed not to access the /private/ directory. Automated tools should respect this directive to comply with the website's policies.


4. Avoiding Header Manipulation

Maintaining standard request headers helps prevent anomalies that might trigger detection systems. Custom headers or unusual values can raise red flags.

By adhering to standard header configurations, automated requests appear more like those from regular users, reducing the likelihood of detection.


5. Preventing Server Overload

Excessive simultaneous downloads can strain server resources and negatively impact website performance. Limiting concurrency ensures responsible use of resources.

let activeDownloads = 0;
const maxConcurrentDownloads = 5;
const downloadQueue = [];

function processQueue() {
    if (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
        const item = downloadQueue.shift();
        activeDownloads++;
        startDownload(item, () => {
            activeDownloads--;
            processQueue();
        });
    }
}

// Add items to the queue and start processing
downloadQueue.push(...itemsToDownload);
processQueue();

In this code, processQueue manages the download queue by ensuring that no more than five downloads occur at the same time. The startDownload function includes the logic for downloading the item and invokes a callback upon completion.

Written on November 30th, 2024


nGeneAutomaticDownloader: Firefox Extension Documentation (Written November 30, 2024 V1.5)

This document provides a comprehensive explanation of the five scripts used in the nGeneAutomaticDownloader Firefox extension. Each section includes the full script with detailed comments and an explanation of how functions and features are implemented to facilitate easier understanding and maintenance.


Section 1: manifest.json

{
  "manifest_version": 2,  // Specifies the version of the manifest file format
  "name": "nGeneAutomaticDownloader",  // The name of the extension
  "version": "1.5",  // The version of the extension
  "description": "Automatically downloads all images and videos from webpages to a 'firefox' folder in your default download directory.",  // A brief description
  "permissions": [
    "downloads",  // Allows use of the downloads API to download files
    "tabs",  // Grants access to browser tabs
    "<all_urls>",  // Allows access to all URLs
    "storage",  // Permits storage and retrieval of data using chrome.storage API
    "webRequest",  // Enables observation and analysis of web requests
    "webRequestBlocking"  // Allows modification or blocking of web requests
  ],
  "background": {
    "scripts": ["background.js"]  // Specifies the background script
  },
  "content_scripts": [
    {
      "matches": ["<all_urls>"],  // The content script will be injected into all pages
      "exclude_matches": ["about:*", "resource://*/*"],  // Excludes internal browser pages
      "js": ["content.js"],  // The content script file
      "run_at": "document_idle"  // Injects the script after the page has loaded
    }
  ],
  "browser_action": {
    "default_title": "nGeneAutomaticDownloader",  // Tooltip text for the browser action icon
    "default_popup": "options.html",  // HTML file displayed when the icon is clicked
    "default_icon": {
      "48": "icons/download-icon.png"  // Icon for the browser action
    }
  },
  "options_ui": {
    "page": "options.html",  // Options page for the extension
    "open_in_tab": false  // Opens the options page as a popup
  },
  "icons": {
    "48": "icons/download-icon.png"  // The extension's icon
  }
}

The manifest.json file is the configuration file for the Firefox extension. It defines essential metadata and specifies the extension's behavior.


Section 2: content.js

(function () {
  // Set to keep track of processed media URLs to prevent duplicates
  const processedMediaUrls = new Set();

  // Main function to process media elements starting from a root node
  function processMediaElements(rootNode) {
    const mediaUrls = []; // Array to collect media URLs found

    // If the root node is not an element or the document itself, exit
    if (rootNode.nodeType !== Node.ELEMENT_NODE && rootNode !== document) {
      return;
    }

    // Nodes to process; start with the root node
    const nodes = rootNode === document ? [document] : [rootNode];

    // Iterate over each node to collect media URLs
    nodes.forEach((node) => {
      // Collect images from <img> tags
      node.querySelectorAll('img').forEach((img) => {
        collectImageFromElement(img, mediaUrls);
      });

      // Collect images from <picture> elements
      node.querySelectorAll('picture source').forEach((source) => {
        collectSrcsetUrls(source, mediaUrls);
      });

      // Collect videos and their source elements
      node.querySelectorAll('video, source').forEach((element) => {
        collectVideoFromElement(element, mediaUrls);
      });

      // Collect images from <object> and <embed> tags
      node.querySelectorAll('object, embed').forEach((element) => {
        collectObjectEmbedMedia(element, mediaUrls);
      });

      // Collect background images from CSS stylesheets
      collectBackgroundImages(mediaUrls);

      // Collect images from inline styles
      collectInlineStyles(mediaUrls);

      // Collect images from <canvas> elements
      node.querySelectorAll('canvas').forEach((canvas) => {
        collectCanvasImage(canvas);
      });

      // Collect images from pseudo-elements (::before and ::after)
      collectPseudoElementImages(mediaUrls);
    });

    // Process the collected media URLs
    processMediaUrls(mediaUrls);
  }

  // Collects image URLs from <img> elements
  function collectImageFromElement(img, mediaUrls) {
    let url = img.src || img.currentSrc; // Get the image source URL
    if (!url) {
      // Check for lazy-loaded images
      url = img.getAttribute('data-src') || img.getAttribute('data-lazy-src');
    }
    if (url) {
      mediaUrls.push(url); // Add the URL to the list
    }
    // Handle srcset attribute for responsive images
    const srcset = img.getAttribute('srcset');
    if (srcset) {
      const srcsetUrls = srcset
        .split(',')
        .map((entry) => entry.trim().split(' ')[0]);
      srcsetUrls.forEach((srcsetUrl) => {
        if (srcsetUrl) {
          mediaUrls.push(srcsetUrl);
        }
      });
    }
  }

  // Collects image URLs from <source> elements in <picture> tags
  function collectSrcsetUrls(element, mediaUrls) {
    const srcset = element.getAttribute('srcset');
    if (srcset) {
      const srcsetUrls = srcset
        .split(',')
        .map((entry) => entry.trim().split(' ')[0]);
      srcsetUrls.forEach((srcsetUrl) => {
        if (srcsetUrl) {
          mediaUrls.push(srcsetUrl);
        }
      });
    }
    // Check for 'src' attribute
    const src = element.getAttribute('src');
    if (src) {
      mediaUrls.push(src);
    }
  }

  // Extracts images from <canvas> elements
  function collectCanvasImage(canvas) {
    try {
      // Convert the canvas content to a data URL
      const dataURL = canvas.toDataURL();
      if (dataURL && !processedMediaUrls.has(dataURL)) {
        processedMediaUrls.add(dataURL); // Mark as processed
        // Send a message to download the data URL
        chrome.runtime.sendMessage(
          { type: 'downloadDataUrl', dataUrl: dataURL },
          function (response) {
            if (chrome.runtime.lastError) {
              console.error(
                `Error sending canvas image: ${chrome.runtime.lastError}`
              );
            }
          }
        );
      }
    } catch (e) {
      console.error('Failed to extract image from canvas:', e);
    }
  }

  // Collects video URLs from <video> and <source> elements
  function collectVideoFromElement(element, mediaUrls) {
    if (element.tagName.toLowerCase() === 'video') {
      let url = element.src || element.currentSrc; // Get the video source URL
      if (!url) {
        url = element.getAttribute('data-src');
      }
      if (url) {
        mediaUrls.push(url);
      }
      // Process <source> elements within the <video>
      element.querySelectorAll('source').forEach((sourceElement) => {
        const sourceUrl =
          sourceElement.src ||
          sourceElement.getAttribute('src') ||
          sourceElement.getAttribute('data-src');
        if (sourceUrl) {
          mediaUrls.push(sourceUrl);
        }
      });
      // Check for 'poster' attribute
      const posterUrl = element.getAttribute('poster');
      if (posterUrl) {
        mediaUrls.push(posterUrl);
      }
    } else if (element.tagName.toLowerCase() === 'source') {
      // For <source> elements outside of <video>
      const sourceUrl =
        element.src ||
        element.getAttribute('src') ||
        element.getAttribute('data-src');
      if (sourceUrl) {
        mediaUrls.push(sourceUrl);
      }
    }
  }

  // Collects media URLs from <object> and <embed> elements
  function collectObjectEmbedMedia(element, mediaUrls) {
    const url = element.data || element.getAttribute('data');
    if (url) {
      mediaUrls.push(url);
    }
  }

  // Collects background images from CSS stylesheets
  function collectBackgroundImages(mediaUrls) {
    for (const sheet of document.styleSheets) {
      let rules;
      try {
        rules = sheet.cssRules; // Get CSS rules
      } catch (e) {
        // Skip cross-origin stylesheets
        continue;
      }

      if (!rules) continue;

      for (const rule of rules) {
        if (rule.type === CSSRule.STYLE_RULE) {
          const style = rule.style;
          const bgImage =
            style.getPropertyValue('background-image') ||
            style.getPropertyValue('background');
          extractUrlsFromStyle(bgImage, mediaUrls);
        } else if (rule.type === CSSRule.MEDIA_RULE) {
          // Handle @media rules
          for (const mediaRule of rule.cssRules) {
            if (mediaRule.type === CSSRule.STYLE_RULE) {
              const style = mediaRule.style;
              const bgImage =
                style.getPropertyValue('background-image') ||
                style.getPropertyValue('background');
              extractUrlsFromStyle(bgImage, mediaUrls);
            }
          }
        }
      }
    }
  }

  // Collects background images from inline styles
  function collectInlineStyles(mediaUrls) {
    document.querySelectorAll('*[style]').forEach((element) => {
      const style = element.getAttribute('style');
      extractUrlsFromStyle(style, mediaUrls);
    });
  }

  // Collects images from pseudo-elements (::before and ::after)
  function collectPseudoElementImages(mediaUrls) {
    document.querySelectorAll('*').forEach((element) => {
      ['::before', '::after'].forEach((pseudo) => {
        const style = getComputedStyle(element, pseudo);
        const bgImage =
          style.getPropertyValue('background-image') ||
          style.getPropertyValue('background');
        extractUrlsFromStyle(bgImage, mediaUrls);
      });
    });
  }

  // Extracts URLs from CSS style properties
  function extractUrlsFromStyle(styleValue, mediaUrls) {
    if (styleValue && styleValue !== 'none') {
      // Match URLs in the style value
      const urls = styleValue.match(/url\(["']?([^"')]+)["']?\)/g);
      if (urls) {
        urls.forEach((urlString) => {
          const url = urlString.match(/url\(["']?([^"')]+)["']?\)/)[1];
          if (url) {
            // Resolve relative URLs to absolute URLs
            const absoluteUrl = new URL(url, location.href).href;
            mediaUrls.push(absoluteUrl);
          }
        });
      }
    }
  }

  // Processes the collected media URLs
  function processMediaUrls(mediaUrls) {
    // Remove duplicates and already processed URLs
    const uniqueUrls = Array.from(new Set(mediaUrls));
    uniqueUrls.forEach((url) => {
      const cleanUrl = url.split('#')[0]; // Remove fragment identifiers

      if (processedMediaUrls.has(cleanUrl)) {
        return; // Skip already processed URLs
      }

      processedMediaUrls.add(cleanUrl); // Mark as processed

      // Handle data URLs
      if (url.startsWith('data:')) {
        // Send a message to download the data URL
        chrome.runtime.sendMessage(
          { type: 'downloadDataUrl', dataUrl: url },
          function (response) {
            if (chrome.runtime.lastError) {
              console.error(
                `Error sending data URL: ${chrome.runtime.lastError}`
              );
            }
          }
        );
        return;
      }

      let filename;
      try {
        const urlObj = new URL(url, location.href); // Create a URL object
        filename = urlObj.pathname.split('/').pop(); // Extract the filename
        if (!filename || filename.length === 0) {
          filename = 'unnamed'; // Default filename
        }
        // Try to get file extension
        let extension = filename.includes('.') ? filename.split('.').pop() : '';
        if (!extension) {
          // Guess extension from MIME type
          const mimeType = urlObj.searchParams.get('type') || '';
          if (mimeType) {
            extension = mimeType.split('/').pop();
          }
        }
        if (extension) {
          filename += '.' + extension;
        }
      } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return;
      }

      // Send a message to download the file
      chrome.runtime.sendMessage(
        { type: 'download', url: url, filename: filename },
        function (response) {
          if (chrome.runtime.lastError) {
            console.error(
              `Error sending message for ${filename}: ${chrome.runtime.lastError}`
            );
          }
        }
      );
    });
  }

  // Observes changes in the DOM to detect new media elements
  const observer = new MutationObserver((mutations) => {
    mutations.forEach((mutation) => {
      if (mutation.type === 'childList') {
        // If nodes are added to the DOM
        mutation.addedNodes.forEach((node) => {
          if (node.nodeType === Node.ELEMENT_NODE) {
            processMediaElements(node); // Process the new node
            // Also process media elements within this node
            node
              .querySelectorAll(
                'img, video, source, picture source, object, embed, canvas'
              )
              .forEach((element) => {
                processMediaElements(element);
              });
          }
        });
      } else if (mutation.type === 'attributes') {
        // If attributes of an element have changed
        if (mutation.target && mutation.target.nodeType === Node.ELEMENT_NODE) {
          // Check if the changed attribute is relevant
          const relevantAttributes = [
            'src',
            'srcset',
            'style',
            'data-src',
            'data-lazy-src',
            'poster',
            'data',
            'href',
          ];
          if (relevantAttributes.includes(mutation.attributeName)) {
            processMediaElements(mutation.target); // Process the element
          }
        }
      }
    });
  });

  // Start observing the document for changes
  observer.observe(document, {
    childList: true, // Observe when nodes are added or removed
    subtree: true, // Observe all descendant nodes
    attributes: true, // Observe attribute changes
    attributeFilter: [
      'src',
      'srcset',
      'style',
      'data-src',
      'data-lazy-src',
      'poster',
      'data',
      'href',
    ], // Attributes to observe
  });

  // Listen for user interactions to trigger processing
  ['click', 'scroll', 'mousemove', 'touchstart', 'touchmove'].forEach(
    (event) => {
      document.addEventListener(event, () => {
        processMediaElements(document);
      });
    }
  );

  // Initial processing when the window loads
  window.addEventListener('load', () => {
    processMediaElements(document);
  });
})();

The content.js script is a content script that runs in the context of web pages. Its primary purpose is to identify and collect all media elements (images, videos, etc.) on a web page and send messages to the background script to download these media files.


Section 3: background.js

// Variables to manage downloads and settings
let downloadQueue = []; // Queue for download requests
let activeDownloads = 0; // Number of active downloads
let maxConcurrentDownloads = 10; // Default maximum concurrent downloads
let keepTrack = true; // Whether to keep track of downloaded filenames
let minFileSize = 50 * 1024; // Minimum file size in bytes (default 50 KB)
let downloadedFilenames = new Set(); // Set to store filenames of downloaded files
let downloadedUrls = new Set(); // Set to store URLs of downloaded files

// Load initial settings from chrome.storage.local
chrome.storage.local.get(
  [
    'threads',
    'keepTrack',
    'minFileSize',
    'downloadedFilenames',
    'downloadedUrls',
  ],
  (result) => {
    // Update variables with saved settings or use defaults
    maxConcurrentDownloads = result.threads || 10;
    keepTrack = result.keepTrack !== false; // Default to true if undefined
    minFileSize = (result.minFileSize || 50) * 1024; // Convert KB to bytes
    if (result.downloadedFilenames && Array.isArray(result.downloadedFilenames)) {
      downloadedFilenames = new Set(result.downloadedFilenames); // Initialize set with saved filenames
    }
    if (result.downloadedUrls && Array.isArray(result.downloadedUrls)) {
      downloadedUrls = new Set(result.downloadedUrls); // Initialize set with saved URLs
    }
  }
);

// Listener for messages from other parts of the extension
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  if (message.type === 'updateSettings') {
    // Update settings when they are changed in options
    chrome.storage.local.get(
      ['threads', 'keepTrack', 'minFileSize'],
      (result) => {
        maxConcurrentDownloads = result.threads || 10;
        keepTrack = result.keepTrack !== false;
        minFileSize = (result.minFileSize || 50) * 1024;
        console.log(
          `Updated settings: maxConcurrentDownloads = ${maxConcurrentDownloads}, keepTrack = ${keepTrack}, minFileSize = ${minFileSize} bytes`
        );
      }
    );
  } else if (message.type === 'resetQueue') {
    // Reset the download queue
    downloadQueue = [];
    sendResponse({ status: 'success' });
  } else if (message.type === 'clearHistory') {
    // Clear the set of downloaded filenames and URLs
    downloadedFilenames.clear();
    downloadedUrls.clear();
    chrome.storage.local.set(
      { downloadedFilenames: [], downloadedUrls: [] },
      () => {
        sendResponse({ status: 'success' });
      }
    );
    return true; // Keep the message channel open for sendResponse
  } else if (message.type === 'download') {
    // Handle download request from content script
    const url = message.url;
    const filename = message.filename;

    // Check if the file has already been downloaded
    if (keepTrack && downloadedUrls.has(url)) {
      console.log(
        `Skipping download for URL ${url} as it has already been downloaded.`
      );
      sendResponse({ status: 'skipped' });
      return;
    }

    // Add the download request to the queue
    downloadQueue.push({ url: url, filename: filename });

    // Start processing the queue
    processQueue();

    sendResponse({ status: 'queued' });
  } else if (message.type === 'downloadDataUrl') {
    // Handle download request for data URL
    const dataUrl = message.dataUrl;
    const filename = `image_${Date.now()}.png`; // Generate a unique filename

    // Convert data URL to Blob
    fetch(dataUrl)
      .then((res) => res.blob())
      .then((blob) => {
        const url = URL.createObjectURL(blob);
        downloadQueue.push({ url: url, filename: filename, revokeUrl: true });
        processQueue();
        sendResponse({ status: 'queued' });
      })
      .catch((error) => {
        console.error(`Failed to download data URL: ${error}`);
        sendResponse({ status: 'error', error: error.toString() });
      });
    return true; // Keep the message channel open for sendResponse
  }
});

// Function to process the download queue
function processQueue() {
  // Continue processing while there are slots for active downloads
  while (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
    const item = downloadQueue.shift(); // Get the next item from the queue
    startDownload(item); // Start the download
  }
}

// Function to start a download
function startDownload(item) {
  activeDownloads++; // Increment the number of active downloads

  const filename = item.filename || item.url.split('/').pop();
  const fullFilename = `firefox/${sanitizeFilename(filename)}`; // Prepend 'firefox/' to create a subdirectory

  chrome.downloads.download(
    {
      url: item.url,
      filename: fullFilename,
      saveAs: false, // Do not prompt the user for save location
      conflictAction: 'overwrite', // Overwrite existing files
    },
    (downloadId) => {
      if (chrome.runtime.lastError) {
        // Handle errors during download initiation
        console.error(
          `Download failed for ${filename}: ${chrome.runtime.lastError}`
        );
        activeDownloads--; // Decrement active downloads
        processQueue(); // Try the next item in the queue
      } else {
        console.log(
          `Download started: ID = ${downloadId}, filename = ${filename}`
        );

        // Listener for changes in the download state
        function onChanged(delta) {
          if (
            delta.id === downloadId &&
            delta.state &&
            delta.state.current === 'complete'
          ) {
            // Download completed successfully
            chrome.downloads.search({ id: downloadId }, function (items) {
              if (items && items.length > 0) {
                const downloadItem = items[0];
                const fileSize =
                  downloadItem.fileSize || downloadItem.totalBytes; // Get the file size
                if (fileSize < minFileSize) {
                  // File is smaller than the minimum size
                  chrome.downloads.removeFile(downloadId, function () {
                    if (chrome.runtime.lastError) {
                      console.error(
                        `Failed to remove file: ${chrome.runtime.lastError}`
                      );
                    } else {
                      console.log(
                        `Removed file ${filename} (size ${fileSize} bytes) because it is smaller than the minimum size (${minFileSize} bytes).`
                      );
                    }
                    // Remove the download from history
                    chrome.downloads.erase({ id: downloadId });
                  });
                } else {
                  // File meets the size requirement
                  if (keepTrack) {
                    // Add the filename and URL to the sets
                    downloadedFilenames.add(filename);
                    downloadedUrls.add(item.url);
                    // Update the stored sets
                    chrome.storage.local.set({
                      downloadedFilenames: Array.from(downloadedFilenames),
                      downloadedUrls: Array.from(downloadedUrls),
                    });
                  }
                }
              }
            });

            // Cleanup after download completion
            chrome.downloads.onChanged.removeListener(onChanged);
            activeDownloads--; // Decrement active downloads
            processQueue(); // Process the next item

            // Revoke object URL if necessary
            if (item.revokeUrl) {
              URL.revokeObjectURL(item.url);
            }
          } else if (
            delta.id === downloadId &&
            delta.state &&
            delta.state.current === 'interrupted'
          ) {
            // Download was interrupted
            console.error(`Download interrupted for ${filename}`);
            chrome.downloads.onChanged.removeListener(onChanged);
            activeDownloads--; // Decrement active downloads
            processQueue(); // Process the next item

            // Revoke object URL if necessary
            if (item.revokeUrl) {
              URL.revokeObjectURL(item.url);
            }
          }
        }

        // Add the listener to monitor download changes
        chrome.downloads.onChanged.addListener(onChanged);
      }
    }
  );
}

// Function to sanitize filenames
function sanitizeFilename(filename) {
  return filename.replace(/[\\/:*?"<>|]/g, '_');
}

// Use the webRequest API to monitor network requests
chrome.webRequest.onCompleted.addListener(
  (details) => {
    // Check if the request is for an image or video
    if (details.type === 'image' || details.type === 'media') {
      const url = details.url;

      // Extract filename from URL
      let filename;
      try {
        const urlObj = new URL(url);
        filename = urlObj.pathname.split('/').pop();
        if (!filename || filename.length === 0) {
          filename = 'unnamed';
        }
      } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return;
      }

      filename = filename.split('?')[0];

      // Sanitize filename
      filename = sanitizeFilename(filename);

      // Check if the file has already been downloaded
      if (keepTrack && downloadedUrls.has(url)) {
        console.log(
          `Skipping download for URL ${url} as it has already been downloaded.`
        );
        return;
      }

      // Add the download request to the queue
      downloadQueue.push({ url: url, filename: filename });

      // Start processing the queue
      processQueue();
    }
  },
  { urls: ['<all_urls>'] },
  []
);

The background.js script runs in the background context of the extension. It manages the downloading of media files, handles settings, and communicates with the content script and options page.


Section 4: options.js

// Wait until the DOM content is fully loaded
document.addEventListener('DOMContentLoaded', () => {
  // References to the HTML elements in options.html
  const threadsInput = document.getElementById('threads'); // Input for concurrent downloads
  const keepTrackCheckbox = document.getElementById('keepTrack'); // Checkbox for tracking filenames
  const minFileSizeInput = document.getElementById('minFileSize'); // Input for minimum file size
  const saveButton = document.getElementById('save'); // Button to save settings
  const resetButton = document.getElementById('reset'); // Button to reset the download queue
  const clearHistoryButton = document.getElementById('clearHistory'); // Button to clear history

  // Load saved settings from chrome.storage.local
  chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
    // Set input values to saved settings or defaults
    threadsInput.value = result.threads || 10; // Default to 10 threads
    keepTrackCheckbox.checked = result.keepTrack !== false; // Default to true
    minFileSizeInput.value = result.minFileSize || 50; // Default to 50 KB
  });

  // Event listener for the Save Settings button
  saveButton.addEventListener('click', () => {
    // Retrieve values from the inputs
    const threads = parseInt(threadsInput.value) || 10;
    const keepTrack = keepTrackCheckbox.checked;
    const minFileSize = parseInt(minFileSizeInput.value) || 50;

    // Save the settings to chrome.storage.local
    chrome.storage.local.set(
      { threads: threads, keepTrack: keepTrack, minFileSize: minFileSize },
      () => {
        alert('Settings saved.');
        // Notify the background script to update settings
        chrome.runtime.sendMessage({ type: 'updateSettings' });
      }
    );
  });

  // Event listener for the Reset Download Queue button
  resetButton.addEventListener('click', () => {
    // Send a message to reset the queue
    chrome.runtime.sendMessage({ type: 'resetQueue' }, (response) => {
      if (response.status === 'success') {
        alert('Download queue reset.');
      } else {
        alert('Failed to reset queue.');
      }
    });
  });

  // Event listener for the Clear Downloaded Filenames button
  clearHistoryButton.addEventListener('click', () => {
    // Send a message to clear the history
    chrome.runtime.sendMessage({ type: 'clearHistory' }, (response) => {
      if (response.status === 'success') {
        alert('Downloaded filenames history cleared.');
      } else {
        alert('Failed to clear history.');
      }
    });
  });
});

The options.js script handles the user interface of the extension's options page, allowing users to adjust settings and perform actions.


Section 5: options.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>nGeneAutomaticDownloader Options</title>
  <style>
    /* Basic styling for the options page */
    body { font-family: Arial, sans-serif; padding: 10px; }
    label { display: block; margin-bottom: 5px; }
    input[type="number"] { width: 50px; }
    button { margin-top: 10px; margin-right: 10px; }
  </style>
</head>
<body>
  <h2>Extension Settings</h2>

  <!-- Setting for the number of concurrent downloads -->
  <label>
    Number of Concurrent Downloads:
    <input type="number" id="threads" min="1" max="10">
  </label>

  <!-- Setting to keep track of downloaded filenames -->
  <label>
    Keep track of downloaded filenames:
    <input type="checkbox" id="keepTrack">
  </label>

  <!-- Setting for the minimum file size threshold -->
  <label>
    Minimum File Size (KB):
    <input type="number" id="minFileSize" min="0" step="1">
  </label>

  <!-- Buttons to save settings, reset download queue, and clear history -->
  <button id="save">Save Settings</button>
  <button id="reset">Reset Download Queue</button>
  <button id="clearHistory">Clear Downloaded Filenames</button>

  <!-- Include the options.js script -->
  <script src="options.js"></script>
</body>
</html>

The options.html file defines the user interface for the extension's options page.

Written on November 30th, 2024


nGeneAutomaticDownloader Extension v1.6 (Written November 30, 2024)

The nGeneAutomaticDownloader is a Firefox extension designed to automatically download all images and videos from webpages, organizing them into a designated 'firefox' folder within the user's default download directory. This document provides a comprehensive overview of the extension's implementation across its constituent files. It elucidates the functionalities and features embedded within each script, offering insights into the mechanisms employed for thread management, download tracking, file type handling, and user interface configuration. This overview serves as a reference to facilitate future reviews and modifications of the extension's codebase.

Code Listings

manifest.json

{
  "manifest_version": 2,
  "name": "nGeneAutomaticDownloader",
  "version": "1.6",
  "description": "Automatically downloads all images and videos from webpages to a 'firefox' folder in your default download directory.",
  "permissions": [
    "downloads",
    "tabs",
    "<all_urls>",
    "storage",
    "webRequest",
    "webRequestBlocking"
  ],
  "background": {
    "scripts": ["background.js"]
  },
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "exclude_matches": ["about:*", "resource://*/*"],
      "js": ["content.js"],
      "run_at": "document_idle"
    }
  ],
  "browser_action": {
    "default_title": "nGeneAutomaticDownloader",
    "default_popup": "options.html",
    "default_icon": {
      "48": "icons/download-icon.png"
    }
  },
  "options_ui": {
    "page": "options.html",
    "open_in_tab": false
  },
  "icons": {
    "48": "icons/download-icon.png"
  }
}

content.js

(function () {
  // Set to keep track of processed media URLs to prevent duplicates
  const processedMediaUrls = new Set();

  // Main function to process media elements starting from a root node
  function processMediaElements(rootNode) {
    const mediaUrls = [];

    if (!rootNode) return;

    // Use a Set to avoid processing the same node multiple times
    const nodesToProcess = new Set();

    // Collect nodes to process
    function collectNodes(node) {
      if (node.nodeType !== Node.ELEMENT_NODE) return;

      nodesToProcess.add(node);

      // Recursively collect child nodes
      node.querySelectorAll('*').forEach((child) => {
        nodesToProcess.add(child);
      });
    }

    collectNodes(rootNode);

    // Process each node
    nodesToProcess.forEach((node) => {
      // Process media elements based on their tag names
      const tagName = node.tagName.toLowerCase();

      if (tagName === 'img') {
        collectImageFromElement(node, mediaUrls);
      } else if (tagName === 'video' || tagName === 'audio') {
        collectMediaFromElement(node, mediaUrls);
      } else if (tagName === 'source') {
        collectSourceFromElement(node, mediaUrls);
      } else if (tagName === 'picture') {
        collectPictureSources(node, mediaUrls);
      } else if (tagName === 'object' || tagName === 'embed') {
        collectObjectEmbedMedia(node, mediaUrls);
      } else if (tagName === 'canvas') {
        collectCanvasImage(node);
      }
    });

    // Collect background images from styles
    collectBackgroundImages(mediaUrls);
    collectInlineStyles(mediaUrls);
    collectPseudoElementImages(mediaUrls);

    // Process the collected media URLs
    processMediaUrls(mediaUrls);
  }

  // Collect image URLs from <img> elements
  function collectImageFromElement(img, mediaUrls) {
    const urls = [];

    // src attribute
    if (img.src) {
      urls.push(img.src);
    }

    // data-src or data-lazy-src attributes for lazy-loaded images
    const dataSrc = img.getAttribute('data-src') || img.getAttribute('data-lazy-src');
    if (dataSrc) {
      urls.push(dataSrc);
    }

    // srcset attribute
    const srcset = img.getAttribute('srcset');
    if (srcset) {
      const srcsetUrls = srcset
        .split(',')
        .map((entry) => entry.trim().split(' ')[0])
        .filter((url) => url);
      urls.push(...srcsetUrls);
    }

    // Add collected URLs to mediaUrls
    mediaUrls.push(...urls);
  }

  // Collect media URLs from <video> and <audio> elements
  function collectMediaFromElement(mediaElement, mediaUrls) {
    const urls = [];

    // src attribute
    if (mediaElement.src) {
      urls.push(mediaElement.src);
    }

    // data-src attribute
    const dataSrc = mediaElement.getAttribute('data-src');
    if (dataSrc) {
      urls.push(dataSrc);
    }

    // Poster attribute (for videos)
    const poster = mediaElement.getAttribute('poster');
    if (poster) {
      urls.push(poster);
    }

    // Collect from child <source> elements
    mediaElement.querySelectorAll('source').forEach((source) => {
      collectSourceFromElement(source, urls);
    });

    // Add collected URLs to mediaUrls
    mediaUrls.push(...urls);
  }

  // Collect URLs from <source> elements
  function collectSourceFromElement(sourceElement, mediaUrls) {
    const urls = [];

    // src attribute
    const src = sourceElement.src || sourceElement.getAttribute('src');
    if (src) {
      urls.push(src);
    }

    // data-src attribute
    const dataSrc = sourceElement.getAttribute('data-src');
    if (dataSrc) {
      urls.push(dataSrc);
    }

    // srcset attribute
    const srcset = sourceElement.getAttribute('srcset');
    if (srcset) {
      const srcsetUrls = srcset
        .split(',')
        .map((entry) => entry.trim().split(' ')[0])
        .filter((url) => url);
      urls.push(...srcsetUrls);
    }

    // Add collected URLs to mediaUrls
    mediaUrls.push(...urls);
  }

  // Collect sources from <picture> elements
  function collectPictureSources(pictureElement, mediaUrls) {
    pictureElement.querySelectorAll('source').forEach((source) => {
      collectSourceFromElement(source, mediaUrls);
    });
  }

  // Collect media from <object> and <embed> elements
  function collectObjectEmbedMedia(element, mediaUrls) {
    const data = element.getAttribute('data');
    if (data) {
      mediaUrls.push(data);
    }
    const src = element.getAttribute('src');
    if (src) {
      mediaUrls.push(src);
    }
  }

  // Collect images from <canvas> elements
  function collectCanvasImage(canvas) {
    try {
      const dataURL = canvas.toDataURL();
      if (dataURL && !processedMediaUrls.has(dataURL)) {
        processedMediaUrls.add(dataURL);
        chrome.runtime.sendMessage(
          { type: 'downloadDataUrl', dataUrl: dataURL },
          function (response) {
            if (chrome.runtime.lastError) {
              console.error(`Error sending canvas image: ${chrome.runtime.lastError}`);
            }
          }
        );
      }
    } catch (e) {
      console.error('Failed to extract image from canvas:', e);
    }
  }

  // Collect background images from CSS stylesheets
  function collectBackgroundImages(mediaUrls) {
    for (const sheet of document.styleSheets) {
      let rules;
      try {
        rules = sheet.cssRules;
      } catch (e) {
        continue; // Skip cross-origin stylesheets
      }

      if (!rules) continue;

      for (const rule of rules) {
        if (rule.type === CSSRule.STYLE_RULE) {
          const style = rule.style;
          const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
          extractUrlsFromStyle(bgImage, mediaUrls);
        } else if (rule.type === CSSRule.MEDIA_RULE) {
          for (const mediaRule of rule.cssRules) {
            if (mediaRule.type === CSSRule.STYLE_RULE) {
              const style = mediaRule.style;
              const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
              extractUrlsFromStyle(bgImage, mediaUrls);
            }
          }
        }
      }
    }
  }

  // Collect background images from inline styles
  function collectInlineStyles(mediaUrls) {
    document.querySelectorAll('*[style]').forEach((element) => {
      const style = element.getAttribute('style');
      extractUrlsFromStyle(style, mediaUrls);
    });
  }

  // Collect images from pseudo-elements
  function collectPseudoElementImages(mediaUrls) {
    document.querySelectorAll('*').forEach((element) => {
      ['::before', '::after'].forEach((pseudo) => {
        const style = getComputedStyle(element, pseudo);
        const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
        extractUrlsFromStyle(bgImage, mediaUrls);
      });
    });
  }

  // Extract URLs from CSS style properties
  function extractUrlsFromStyle(styleValue, mediaUrls) {
    if (styleValue && styleValue !== 'none') {
      const urls = styleValue.match(/url\\(["']?([^"')]+)["']?\\)/g);
      if (urls) {
        urls.forEach((urlString) => {
          const url = urlString.match(/url\\(["']?([^"')]+)["']?\\)/)[1];
          if (url) {
            const absoluteUrl = new URL(url, location.href).href;
            mediaUrls.push(absoluteUrl);
          }
        });
      }
    }
  }

  // Process collected media URLs
  function processMediaUrls(mediaUrls) {
    const uniqueUrls = Array.from(new Set(mediaUrls));

    uniqueUrls.forEach((url) => {
      const cleanUrl = url.split('#')[0];

      if (processedMediaUrls.has(cleanUrl)) {
        return;
      }

      processedMediaUrls.add(cleanUrl);

      // Handle data URLs
      if (url.startsWith('data:')) {
        chrome.runtime.sendMessage(
          { type: 'downloadDataUrl', dataUrl: url },
          function (response) {
            if (chrome.runtime.lastError) {
              console.error(`Error sending data URL: ${chrome.runtime.lastError}`);
            }
          }
        );
        return;
      }

      let filename;
      try {
        const urlObj = new URL(url, location.href);
        filename = urlObj.pathname.split('/').pop() || 'unnamed';
      } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return;
      }

      // Send message to background script to download the file
      chrome.runtime.sendMessage(
        { type: 'download', url: url, filename: filename },
        function (response) {
          if (chrome.runtime.lastError) {
            console.error(`Error sending message for ${filename}: ${chrome.runtime.lastError}`);
          }
        }
      );
    });
  }

  // Enhance MutationObserver to detect attribute changes and added nodes
  const observer = new MutationObserver((mutations) => {
    mutations.forEach((mutation) => {
      if (mutation.type === 'childList') {
        // Process added nodes
        mutation.addedNodes.forEach((node) => {
          processMediaElements(node);
        });
      } else if (mutation.type === 'attributes') {
        processMediaElements(mutation.target);
      }
    });
  });

  // Start observing the document for changes
  observer.observe(document, {
    childList: true,
    subtree: true,
    attributes: true,
    attributeFilter: [
      'src',
      'srcset',
      'data-src',
      'data-lazy-src',
      'poster',
      'style',
      'data',
      'href',
    ],
  });

  // Initial processing
  processMediaElements(document);

  // Re-process periodically to catch any missed elements
  setInterval(() => {
    processMediaElements(document);
  }, 5000); // Adjust the interval as needed
})();

background.js

// Variables to manage downloads and settings
let downloadQueue = [];
let activeDownloads = 0;
let maxConcurrentDownloads = 10;
let keepTrack = true;
let minFileSize = 50 * 1024;
let downloadedFilenames = new Set();
let downloadedUrls = new Set();

// Load initial settings from storage
chrome.storage.local.get(
  [
    'threads',
    'keepTrack',
    'minFileSize',
    'downloadedFilenames',
    'downloadedUrls',
  ],
  (result) => {
    maxConcurrentDownloads = result.threads || 10;
    keepTrack = result.keepTrack !== false;
    minFileSize = (result.minFileSize || 50) * 1024;
    if (result.downloadedFilenames && Array.isArray(result.downloadedFilenames)) {
      downloadedFilenames = new Set(result.downloadedFilenames);
    }
    if (result.downloadedUrls && Array.isArray(result.downloadedUrls)) {
      downloadedUrls = new Set(result.downloadedUrls);
    }
  }
);

// Listener for messages from content scripts
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  if (message.type === 'updateSettings') {
    // Update settings
    chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
      maxConcurrentDownloads = result.threads || 10;
      keepTrack = result.keepTrack !== false;
      minFileSize = (result.minFileSize || 50) * 1024;
      console.log(
        `Updated settings: maxConcurrentDownloads = ${maxConcurrentDownloads}, keepTrack = ${keepTrack}, minFileSize = ${minFileSize} bytes`
      );
    });
  } else if (message.type === 'resetQueue') {
    downloadQueue = [];
    sendResponse({ status: 'success' });
  } else if (message.type === 'clearHistory') {
    downloadedFilenames.clear();
    downloadedUrls.clear();
    chrome.storage.local.set(
      { downloadedFilenames: [], downloadedUrls: [] },
      () => {
        sendResponse({ status: 'success' });
      }
    );
    return true;
  } else if (message.type === 'download') {
    const url = message.url;
    const filename = message.filename;

    if (keepTrack && downloadedUrls.has(url)) {
      console.log(`Skipping download for URL ${url} as it has already been downloaded.`);
      sendResponse({ status: 'skipped' });
      return;
    }

    downloadQueue.push({ url: url, filename: filename });
    processQueue();
    sendResponse({ status: 'queued' });
  } else if (message.type === 'downloadDataUrl') {
    const dataUrl = message.dataUrl;
    const filename = `image_${Date.now()}.png`;

    fetch(dataUrl)
      .then((res) => res.blob())
      .then((blob) => {
        const url = URL.createObjectURL(blob);
        downloadQueue.push({ url: url, filename: filename, revokeUrl: true });
        processQueue();
        sendResponse({ status: 'queued' });
      })
      .catch((error) => {
        console.error(`Failed to download data URL: ${error}`);
        sendResponse({ status: 'error', error: error.toString() });
      });
    return true;
  }
});

// Function to process the download queue
function processQueue() {
  while (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
    const item = downloadQueue.shift();
    startDownload(item);
  }
}

// Function to start a download
function startDownload(item) {
  activeDownloads++;

  const filename = item.filename || item.url.split('/').pop();
  const fullFilename = `firefox/${sanitizeFilename(filename)}`;

  chrome.downloads.download(
    {
      url: item.url,
      filename: fullFilename,
      saveAs: false,
      conflictAction: 'overwrite',
    },
    (downloadId) => {
      if (chrome.runtime.lastError) {
        console.error(`Download failed for ${filename}: ${chrome.runtime.lastError}`);
        activeDownloads--;
        processQueue();
      } else {
        console.log(`Download started: ID = ${downloadId}, filename = ${filename}`);

        // Listener for changes in the download state
        function onChanged(delta) {
          if (
            delta.id === downloadId &&
            delta.state &&
            delta.state.current === 'complete'
          ) {
            chrome.downloads.search({ id: downloadId }, function (items) {
              if (items && items.length > 0) {
                const downloadItem = items[0];
                const fileSize = downloadItem.fileSize || downloadItem.totalBytes;
                if (fileSize < minFileSize) {
                  chrome.downloads.removeFile(downloadId, function () {
                    if (chrome.runtime.lastError) {
                      console.error(`Failed to remove file: ${chrome.runtime.lastError}`);
                    } else {
                      console.log(
                        `Removed file ${filename} (size ${fileSize} bytes) because it is smaller than the minimum size (${minFileSize} bytes).`
                      );
                    }
                    chrome.downloads.erase({ id: downloadId });
                  });
                } else {
                  if (keepTrack) {
                    downloadedFilenames.add(filename);
                    downloadedUrls.add(item.url);
                    chrome.storage.local.set({
                      downloadedFilenames: Array.from(downloadedFilenames),
                      downloadedUrls: Array.from(downloadedUrls),
                    });
                  }
                }
              }
            });

            chrome.downloads.onChanged.removeListener(onChanged);
            activeDownloads--;
            processQueue();

            if (item.revokeUrl) {
              URL.revokeObjectURL(item.url);
            }
          } else if (
            delta.id === downloadId &&
            delta.state &&
            delta.state.current === 'interrupted'
          ) {
            console.error(`Download interrupted for ${filename}`);
            chrome.downloads.onChanged.removeListener(onChanged);
            activeDownloads--;
            processQueue();

            if (item.revokeUrl) {
              URL.revokeObjectURL(item.url);
            }
          }
        }

        chrome.downloads.onChanged.addListener(onChanged);
      }
    }
  );
}

// Function to sanitize filenames
function sanitizeFilename(filename) {
  return filename.replace(/[\\/:*?"<>|]/g, '_');
}

// Modify webRequest listener to capture media requests
chrome.webRequest.onCompleted.addListener(
  (details) => {
    const url = details.url;

    // Check if the request is for an image or video
    if (details.type === 'image' || details.type === 'media') {
      if (keepTrack && downloadedUrls.has(url)) {
        console.log(`Skipping download for URL ${url} as it has already been downloaded.`);
        return;
      }

      // Extract filename from URL
      let filename;
      try {
        const urlObj = new URL(url);
        filename = urlObj.pathname.split('/').pop() || 'unnamed';
      } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return;
      }

      filename = sanitizeFilename(filename);

      // Add the download request to the queue
      downloadQueue.push({ url: url, filename: filename });
      processQueue();
    }
  },
  { urls: ['<all_urls>'] },
  []
);

options.js

document.addEventListener('DOMContentLoaded', () => {
  const threadsInput = document.getElementById('threads');
  const keepTrackCheckbox = document.getElementById('keepTrack');
  const minFileSizeInput = document.getElementById('minFileSize');
  const saveButton = document.getElementById('save');
  const resetButton = document.getElementById('reset');
  const clearHistoryButton = document.getElementById('clearHistory');

  // Load saved settings
  chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
    threadsInput.value = result.threads || 10;
    keepTrackCheckbox.checked = result.keepTrack !== false;
    minFileSizeInput.value = result.minFileSize || 50;
  });

  // Save settings
  saveButton.addEventListener('click', () => {
    const threads = parseInt(threadsInput.value) || 10;
    const keepTrack = keepTrackCheckbox.checked;
    const minFileSize = parseInt(minFileSizeInput.value) || 50;

    chrome.storage.local.set(
      { threads: threads, keepTrack: keepTrack, minFileSize: minFileSize },
      () => {
        alert('Settings saved.');
        chrome.runtime.sendMessage({ type: 'updateSettings' });
      }
    );
  });

  // Reset download queue
  resetButton.addEventListener('click', () => {
    chrome.runtime.sendMessage({ type: 'resetQueue' }, (response) => {
      if (response.status === 'success') {
        alert('Download queue reset.');
      } else {
        alert('Failed to reset queue.');
      }
    });
  });

  // Clear download history
  clearHistoryButton.addEventListener('click', () => {
    chrome.runtime.sendMessage({ type: 'clearHistory' }, (response) => {
      if (response.status === 'success') {
        alert('Downloaded filenames history cleared.');
      } else {
        alert('Failed to clear history.');
      }
    });
  });
});

options.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>nGeneAutomaticDownloader Options</title>
  <style>
    body { font-family: Arial, sans-serif; padding: 10px; }
    label { display: block; margin-bottom: 5px; }
    input[type="number"] { width: 50px; }
    button { margin-top: 10px; margin-right: 10px; }
  </style>
</head>
<body>
  <h2>Extension Settings</h2>

  <label>
    Number of Concurrent Downloads:
    <input type="number" id="threads" min="1" max="10">
  </label>

  <label>
    Keep track of downloaded filenames:
    <input type="checkbox" id="keepTrack">
  </label>

  <label>
    Minimum File Size (KB):
    <input type="number" id="minFileSize" min="0" step="1">
  </label>

  <button id="save">Save Settings</button>
  <button id="reset">Reset Download Queue</button>
  <button id="clearHistory">Clear Downloaded Filenames</button>

  <script src="options.js"></script>
</body>
</html>

Feature Implementation

1. Manifest Configuration (manifest.json)

The manifest.json file serves as the blueprint for the Firefox extension, delineating its metadata, permissions, and the scripts it employs.

2. Content Script Functionality (content.js)

The content.js script is responsible for identifying and extracting media elements (images and videos) from the webpages the user visits. Its implementation encompasses several key features:

3. Background Script Operations (background.js)

The background.js script orchestrates the downloading process, managing download queues, threads, and tracking mechanisms.

4. User Interface and Settings Management (options.js and options.html)

The options.html and options.js files collectively provide a user interface for configuring the extension's settings.

5. Thread Management and Concurrency Control

Thread management is pivotal in handling multiple downloads efficiently without overwhelming system resources.

  1. Initialization: Upon startup, the background script retrieves user-defined settings, including the number of concurrent threads (maxConcurrentDownloads), from storage.
  2. Processing Logic: The processQueue function oversees the download queue, initiating downloads as long as the number of active downloads is below the specified limit. This function is invoked whenever a new download is added to the queue or when an active download completes.
  3. Download Lifecycle: Each download task increments the activeDownloads count upon initiation. Listeners monitor the download's progress, decrementing the count and triggering the processing of subsequent queued downloads upon completion or interruption.

6. Download Tracking Mechanism

To avoid redundant downloads and optimize performance, the extension implements a robust tracking system.

7. File Type Handling

The extension is tailored to download specific media file types, primarily images and videos.

8. User Interface Design

The extension's user interface is designed for simplicity and ease of use.

Written on November 30th, 2024


Back to Top