Ethical Considerations of Web Crawling, in Project nGene.org®

Project nGene.org is an advanced academic software designed to facilitate programming and research in the field of hemodynamics, integrating computational modeling, simulation, medical statistics, and machine learning. As part of its multifaceted approach, Project nGene.org employs web crawling (web scraping) to aggregate and analyze vast amounts of biomedical data from various online sources. This section delineates the ethical framework guiding the web crawling activities within Project nGene.org, ensuring that data collection practices align with legal standards and the project's academic integrity.

(A) Purpose of Web Crawling in Project nGene.org

Legitimate and Essential Uses:

Data Aggregation for Research: Project nGene.org utilizes web crawling to collect extensive datasets from publicly available biomedical research articles, clinical trial repositories, and medical databases. This aggregated data is pivotal for developing accurate computational models and simulations in hemodynamics.
Semantic Web Integration: By scraping meta-analyses, survival curves, and other statistical data from medical literature, Project nGene.org enriches its semantic web database. This integration is crucial for enhancing machine learning algorithms and facilitating comprehensive data analysis.
Real-Time Data Updates: Web crawling enables Project nGene.org to maintain up-to-date information on the latest research findings, ensuring that simulations and models reflect current scientific knowledge and clinical practices.

Avoiding Potentially Unethical Uses:

Data Sensitivity and Confidentiality: Project nGene.org is committed to avoiding the collection of sensitive personal information. The focus remains strictly on publicly available data pertinent to hemodynamic research, thereby mitigating risks associated with data theft or privacy breaches.

(B) Respecting Website Policies

Adherence to Ethical Guidelines and Legal Frameworks:

Consent and Authorization: Before initiating any web crawling activity, Project nGene.org ensures that it has the necessary permissions to access and use the targeted data. This involves reviewing the website’s Terms of Service (ToS) and seeking explicit consent when required. Respecting the boundaries set by website administrators prevents unauthorized data access and maintains the integrity of the research.
Non-Intrusive Crawling Practices: Project nGene.org employs crawling techniques that are non-intrusive and do not disrupt the normal functioning of the target websites. By minimizing the frequency and intensity of requests, the software ensures that its activities do not interfere with the user experience or the operational stability of the websites being crawled.
Transparency with Data Sources: Project nGene.org maintains transparency regarding the sources of its data. By clearly documenting and disclosing the origins of the scraped data, the project fosters trust with both data providers and the broader academic community. This transparency also facilitates accountability and allows for verification of data integrity.
Respect for Content Ownership: Recognizing that digital content is often the intellectual property of its creators, Project nGene.org takes care to use the scraped data in ways that honor the rights of content owners. This includes adhering to licensing agreements and avoiding the misuse of proprietary information.

(C) Data Privacy and Consent

Integrating AI Ethics and Data Protection Principles:

a. Ethical Foundations Inspired by AI Ethics:

Post-Humanism Considerations: In the realm of AI ethics, Project nGene.org acknowledges that while AI systems may develop distinct forms of intelligence and decision-making capabilities, the project remains committed to ensuring that data collection practices respect human moral frameworks. The software does not aim to mimic human intelligence but rather to enhance research capabilities within ethical boundaries.
Transparency and Explainability: Project nGene.org prioritizes transparency in its data collection processes. By maintaining clear documentation of how data is gathered, processed, and utilized, the project ensures that its methodologies are understandable and accountable. This approach aligns with the ethical imperative for explainable AI, facilitating trust and reliability in the software's operations.
Responsibility and Accountability: The project recognizes the importance of assigning moral responsibility in its AI-driven processes. Project nGene.org ensures that data collection and usage adhere to ethical standards, preventing misuse and maintaining accountability for the data's application in research and simulations.

b. Compliance with Data Protection Laws:

General Data Protection Regulation (GDPR): Project nGene.org adheres to GDPR principles by ensuring that personal data is collected lawfully, with explicit consent where required. The software implements data minimization strategies, collecting only the data necessary for its research objectives and ensuring that this data is accurate and up-to-date.
California Consumer Privacy Act (CCPA): In compliance with CCPA, Project nGene.org provides mechanisms for individuals to access, correct, or delete their personal data if it inadvertently collects such information. This commitment ensures that the software respects individual privacy rights and maintains data integrity.

c. Advanced Privacy-Preserving Techniques:

Federated Learning: Project nGene.org employs federated learning to process data from multiple sources without centralizing sensitive information. By training machine learning models locally on decentralized data, the project enhances data privacy and reduces the risk of unauthorized data exposure.
Differential Privacy: To further safeguard individual privacy, Project nGene.org integrates differential privacy techniques. By introducing controlled randomness into data processing, the software ensures that the inclusion or exclusion of any single data point does not significantly affect the overall analysis, thereby protecting individual identities and sensitive information.

d. Ethical Data Handling Practices:

Collection Limitation: Project nGene.org restricts data collection to what is necessary for its research purposes, ensuring that no excessive or irrelevant data is harvested.
Data Quality and Purpose Specification: The project maintains high standards for data quality, ensuring that collected data is accurate, relevant, and used strictly for the specified research objectives.
Security Safeguards: Robust security measures are in place to protect collected data from unauthorized access, loss, or misuse. This includes encryption, secure storage solutions, and regular security audits.
Openness and Individual Participation: Project nGene.org fosters transparency by openly communicating its data practices and providing avenues for individuals to exercise their data rights, such as accessing or rectifying their personal information.

(D) Impact on Website Performance

Ensuring Responsible Resource Utilization and Server Security:

a. Responsible Crawling to Minimize Server Load:

Rate Limiting and Throttling: Project nGene.org implements strict rate limiting protocols to control the frequency of requests sent to target websites. By pacing requests similarly to human browsing behavior, the software minimizes the risk of overloading servers, thereby maintaining the operational stability of the websites being crawled.
Optimized Data Retrieval: The software is designed to fetch only essential data, reducing unnecessary bandwidth consumption. This optimization ensures efficient use of network resources, preventing undue strain on both the crawler and the target websites.

b. Robust Server Security Measures:

Transition to macOS Server: Initially operating on a Linux Debian server, Project nGene.org has recognized the need for enhanced security with the introduction of federated learning and web-based deep learning server operations. Transitioning to a macOS Server, equipped with a Silicon chip, provides better control over CPU memory and network resources. This move ensures efficient management and prevents excessive usage by any single user, aligning with the project's goal of supporting the rapid growth of clinical applied hemodynamic research and development.
Comprehensive Security Protocols: Through extensive learning and testing of security measures, Project nGene.org ensures that its server infrastructure is fortified against potential threats. This includes implementing secure authentication methods, regular security updates, and monitoring systems to detect and respond to unauthorized access attempts.
Documentation and Knowledge Sharing: Recognizing the scarcity of macOS security resources, Project nGene.org plans to document and share the security strategies employed in its server setup. This initiative not only enhances the project's own security posture but also contributes valuable insights to the broader research community facing similar challenges.

c. Sustainable and Ethical Resource Management:

Preventing Excessive Usage: By managing CPU memory and network resources effectively, Project nGene.org ensures that its operations do not monopolize server capabilities. This responsible resource management fosters sustainability and ethical usage, preventing any negative impact on the server's performance and availability for other users or processes.
Contribution to Research Efficiency: The optimized server performance directly supports the project's mission to advance hemodynamic research. Efficient server operations enable faster data processing and simulation runs, facilitating timely and impactful research outcomes without compromising ethical standards of resource usage.

(E) Intellectual Property and Copyright

Navigating Intellectual Property in the Software Era:

a. Understanding the Dual Nature of Software:

Creative and Functional Aspects: Project nGene.org recognizes that software embodies both creative expression and functional utility. While the underlying algorithms and functional components serve practical purposes in hemodynamic research, the software's design, user interface, and structural elements reflect creative problem-solving and innovative thinking. This duality necessitates a nuanced approach to intellectual property management.
Challenges in Legal Frameworks: Traditional intellectual property laws, such as copyright for artistic works and patents for inventions, often struggle to adequately address the unique characteristics of software. Project nGene.org navigates this complexity by ensuring that both the creative and functional aspects of the software are protected and utilized responsibly.

b. Copyright Limitations and Fair Use Considerations:

Expiration and Duration: Project nGene.org adheres to copyright laws that grant creators exclusive rights for a specific duration—typically the creator's life plus 70 years or 95 years from publication for corporate works. Understanding these limitations ensures that the software respects the temporal boundaries of content protection.
Fair Use and Fair Dealing: Project nGene.org leverages fair use principles to utilize copyrighted materials for educational and transformative purposes, such as developing simulations and conducting research. By adhering to the four key factors of fair use—purpose, nature, amount, and effect on the market—the project ensures that its use of copyrighted content remains within legal and ethical boundaries.
Reverse Engineering and Interoperability: While Project nGene.org may engage in reverse engineering to develop compatible tools and enhance interoperability within its software ecosystem, it ensures that such activities comply with fair use doctrines and do not infringe upon the rights of original content creators. This careful balance fosters innovation while respecting intellectual property rights.

c. Paracopyright and Digital Rights Management (DRM):

Licensing Agreements: Project nGene.org operates under clear licensing agreements that outline permissible uses of its software. These agreements prevent unauthorized modifications, reverse engineering, or redistribution, safeguarding the project's intellectual property and ensuring that the software is used as intended.
DRM Systems: To protect against unauthorized access and distribution, Project nGene.org implements DRM technologies. These measures prevent illicit copying, sharing, or tampering with the software, thereby upholding the project's intellectual property rights and maintaining the software's integrity.
Compliance with Anti-Circumvention Laws: Project nGene.org ensures that its DRM implementations comply with relevant laws, such as the Digital Millennium Copyright Act (DMCA) in the U.S. By avoiding techniques that bypass security measures without authorization, the project maintains legal compliance and ethical standards in its intellectual property management.

d. Idea/Expression Distinction and Application in Software:

Protecting Expressions, Not Ideas: Project nGene.org understands that while ideas, algorithms, and functional concepts are not protected under copyright law, the specific implementation and expression of these ideas are. This distinction allows the project to innovate and build upon existing knowledge while safeguarding its unique contributions.
Ethical Use of RGB Values and Design Elements: In developing simulations and visual representations within the software, Project nGene.org ensures that it does not infringe upon copyrighted designs or trademarks. For instance, while RGB values used in color harmonies are considered factual data and are not protected, the overall design and aesthetic elements are treated with respect for intellectual property rights.

e. Promoting Open Innovation and Collaboration:

Balancing Protection with Accessibility: Project nGene.org strives to balance the protection of its intellectual property with the promotion of open innovation. By providing accessible tools and resources within ethical boundaries, the project fosters collaboration and knowledge sharing in the academic community without compromising its own IP rights.
Encouraging Transformative Uses: The project supports transformative uses of its software, such as adapting and enhancing its functionalities for diverse research applications. This approach aligns with fair use principles, allowing for the evolution and improvement of the software in ways that benefit the broader research landscape.

(F) Transparency and Accountability

Ensuring Openness and Responsible Stewardship:

Clear Identification and Communication: Project nGene.org employs transparent identification practices in its web crawling activities. By using descriptive User-Agent strings that include contact information, the project ensures that its presence is identifiable and accountable. This transparency allows website administrators and stakeholders to recognize the source of data requests and facilitates open communication channels.
Comprehensive Documentation: The project maintains detailed documentation of its data collection methodologies, sources, and usage protocols. This documentation not only supports internal accountability but also provides external stakeholders with insights into how data is gathered and utilized, fostering trust and credibility within the academic community.
Auditable Processes: Project nGene.org implements auditable processes for data handling and web crawling operations. Regular audits and reviews ensure that ethical standards are consistently met and that any deviations are promptly addressed. This commitment to accountability upholds the project's integrity and aligns with best practices in responsible research.
Feedback and Collaboration: Project nGene.org encourages feedback from data providers, collaborators, and the broader research community. By actively seeking input and engaging in collaborative discussions, the project remains responsive to ethical concerns and continuously improves its practices to align with evolving standards and expectations.
Ethical Governance: The project establishes an ethical governance framework that oversees its data collection and usage practices. This framework includes guidelines, policies, and oversight mechanisms that ensure all activities are conducted responsibly and ethically, reinforcing the project's commitment to integrity and accountability.

(G) Compliance with Legal Frameworks

Adhering to Comprehensive Legal Standards and Regulations:

Data Protection Laws:

General Data Protection Regulation (GDPR): Project nGene.org ensures full compliance with GDPR by implementing data protection principles such as data minimization, purpose limitation, and ensuring lawful processing of personal data. The project provides mechanisms for data subjects to exercise their rights, including access, rectification, and deletion of their personal information.
California Consumer Privacy Act (CCPA): In alignment with CCPA, Project nGene.org offers users the ability to opt-out of data collection practices, access their personal data, and request deletion of their information. This compliance ensures that the project respects individual privacy rights and maintains transparency in data handling.
OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data: Project nGene.org adheres to the OECD’s core principles, ensuring that personal data is collected fairly, maintained accurately, used solely for specified purposes, and protected with robust security measures. The project upholds individual participation rights, allowing data subjects to access and control their information.

Anti-Circumvention Laws:

Digital Millennium Copyright Act (DMCA): Project nGene.org complies with DMCA provisions by avoiding the circumvention of digital rights management (DRM) systems and other security measures implemented by websites. The project ensures that its web crawling activities do not infringe upon protected content or violate anti-circumvention laws.

International Compliance:

Cross-Border Data Transfers: Project nGene.org implements safeguards for cross-border data transfers, ensuring that data processed in compliance with GDPR, CCPA, and other regional regulations. This includes using encryption, secure data storage solutions, and adhering to transfer mechanisms such as Standard Contractual Clauses (SCCs) when handling data across different jurisdictions.
Local Laws and Regulations: The project remains informed about and compliant with local data protection and privacy laws in all regions where it operates. This proactive approach ensures that Project nGene.org navigates the complex legal landscape effectively, minimizing the risk of non-compliance and legal repercussions.

Intellectual Property Laws:

Copyright and Licensing Compliance: Project nGene.org respects copyright laws by ensuring that all scraped content is used within the boundaries of fair use or under appropriate licenses. The project avoids unauthorized reproduction or distribution of copyrighted material, maintaining legal and ethical standards in its operations.
Patent Laws: When utilizing or integrating patented technologies, Project nGene.org ensures that it holds the necessary licenses or permissions. This compliance prevents infringement of patent rights and supports the lawful use of proprietary innovations.

Ethical Standards and Best Practices:

Continuous Monitoring and Adaptation: Project nGene.org actively monitors changes in legal frameworks and adapts its practices accordingly. This vigilance ensures ongoing compliance and responsiveness to new regulations, fostering a culture of ethical and lawful data handling.
Legal Consultation and Expertise: The project engages with legal experts to navigate complex legal requirements and ensure that all web crawling activities are conducted within the bounds of the law. This collaboration provides assurance that Project nGene.org operates with a thorough understanding of its legal obligations.

Best Practices Implemented by Project nGene.org for Ethical Web Crawling

Respect Website Policies:
- Action: Before initiating any crawling activity, Project nGene.org reviews and complies with the target website’s Terms of Service. This ensures that web crawling activities respect the access permissions and restrictions set by website administrators, maintaining a respectful and non-intrusive presence.
Limit Request Rates:
- Action: The software incorporates rate limiting mechanisms, spacing out requests to mimic human browsing patterns and prevent server strain.
Identify Your Crawler:
- Action: Utilizing a clear and descriptive User-Agent string that includes contact information, Project nGene.org ensures transparency in its web crawling operations.
Avoid Collecting Sensitive Data:
- Action: The project focuses solely on publicly available and non-sensitive data, deliberately avoiding the collection of personal, confidential, or restricted information.
Respect Data Privacy:
- Action: Project nGene.org adheres to data protection regulations, implementing robust data security measures to safeguard collected information.
Provide Opt-Out Mechanisms:
- Action: While not always directly controllable, Project nGene.org responds promptly to requests from website owners to cease crawling, respecting their preferences and maintaining ethical standards.
Use Data Responsibly:
- Action: The collected data is utilized solely for academic research and development within the field of hemodynamics, avoiding misuse or unauthorized distribution.
Stay Informed:
- Action: The project team remains updated on evolving laws, regulations, and best practices related to web crawling and data collection, ensuring ongoing compliance and ethical conduct.

Comparative Analysis of Three Web Crawler Prototypes

Project nGene.org has developed three distinct web crawler prototypes, each utilizing different programming languages and methodologies. These prototypes serve as foundational tools for automated data collection and analysis, essential for advancing research objectives. This analysis delineates the programming characteristics, advantages, and limitations of each version, providing a comprehensive understanding of their operational dynamics. It is important to note that these implementations are in the prototype stage, primarily designed for testing and evaluation purposes.

(A) JavaScript-Based Web Crawler

Programming Language and Environment

Language: JavaScript
Environment: Client-side execution within a web browser
Technologies Used: HTML, CSS, Fetch API, DOMParser, Asynchronous JavaScript (async/await), Event Listeners

Architecture and Design

User Interface: Incorporates HTML input fields for URL, depth, and tag specifications, alongside buttons to initiate and terminate the crawling process.
Crawling Logic: Utilizes the Fetch API to retrieve web pages and DOMParser to parse HTML content. The crawler targets specific HTML tags (e.g., <a>) to extract links.
Concurrency: Implements asynchronous operations using async/await to handle multiple fetch requests without blocking the main thread.
State Management: Employs a stopCrawling flag to control the initiation and termination of the crawling process.

Advantages

Ease of Deployment: Being client-side, it does not require server-side infrastructure, facilitating quick testing and deployment within a browser environment.
Real-Time Interaction: Provides immediate feedback and control through the browser interface, enhancing user experience.
Simplicity: The use of native browser APIs like Fetch and DOMParser simplifies implementation, reducing dependencies on external libraries.

Limitations

CORS Policy Restrictions: Browsers enforce Cross-Origin Resource Sharing (CORS) policies, limiting access to resources from different domains unless explicitly permitted by the target server.
Same-Origin Policy Constraints: The same-origin policy restricts interactions with resources outside the originating domain, confining operations primarily to the same domain or to domains that allow cross-origin requests.
Performance and Scalability: Client-side execution within a browser is not optimized for intensive crawling tasks. High recursion depths and extensive data processing can lead to significant slowdowns or browser crashes.
Handling Dynamic Content: The crawler parses only static HTML content and does not execute or render JavaScript, resulting in incomplete data collection from modern websites that rely heavily on client-side rendering.

Potential Mitigation Strategies

Using CORS Proxies: Routing requests through CORS proxies can bypass browser-enforced restrictions, enabling access to cross-origin resources.
Server-Side Crawling Integration: Shifting crawling operations to a server-side environment can circumvent CORS policies and enhance scalability and performance.
Incorporating Headless Browsers: Utilizing headless browsers can facilitate the execution of JavaScript, allowing access to dynamically loaded content.

(B) Python-Based Web Crawler Utilizing Requests and BeautifulSoup

Programming Language and Environment

Language: Python
Environment: Server-side execution
Technologies Used: requests, BeautifulSoup, concurrent.futures.ThreadPoolExecutor, urllib.parse

Architecture and Design

Download Directory Configuration: Specifies a download directory, creating it if it does not exist, to systematically store retrieved files.
Concurrency: Employs ThreadPoolExecutor to enable multithreaded operations, enhancing performance by handling multiple tasks concurrently.
HTML Parsing: Utilizes BeautifulSoup to parse HTML content and extract relevant links and resources.
Error Handling: Incorporates try-except blocks to manage exceptions during HTTP requests and file operations, ensuring robustness against network issues and invalid URLs.
Dynamic URL Construction: Uses urllib.parse.urljoin to construct absolute URLs from relative links, ensuring accurate resource retrieval.

Advantages

Scalability: Server-side execution allows for handling large-scale data collection tasks without being constrained by browser performance limitations.
Flexibility: The use of Python libraries like requests and BeautifulSoup provides powerful tools for HTTP requests and HTML parsing, facilitating complex data extraction processes.
Concurrency Management: Multithreading capabilities enable efficient handling of multiple tasks simultaneously, reducing overall execution time.

Limitations

Network Dependency and Latency: The crawler's performance is heavily reliant on network stability and the responsiveness of target servers, making it susceptible to network disruptions.
Rate Limiting and Server Load: Intensive crawling can impose significant load on target servers, potentially leading to throttling or blocking, necessitating the implementation of polite crawling practices.
Handling JavaScript-Rendered Content: Similar to the JavaScript-based crawler, this implementation may struggle with websites that rely heavily on JavaScript for content rendering, resulting in incomplete data collection.
File Naming and Management: Ensuring unique and meaningful filenames to prevent overwriting and maintain organization can be challenging, especially when dealing with numerous files from diverse sources.

Potential Mitigation Strategies

Implementing Rate Limiting: Introducing delays between requests can reduce server load and mitigate the risk of being throttled or blocked.
Advanced Parsing Techniques: Incorporating tools like Selenium or headless browsers can enhance the crawler's ability to handle JavaScript-rendered content.
Robust File Management: Developing a systematic approach to file naming and storage can prevent conflicts and improve data organization.

(C) Selenium-Based Web Crawler Utilizing Browser Automation

Programming Language and Environment

Language: Python
Environment: Server-side execution with browser automation
Technologies Used: selenium (with Firefox WebDriver), BeautifulSoup, requests, threading

Architecture and Design

Browser Automation: Leverages Selenium WebDriver to automate interactions with web pages, including navigating through pages and executing JavaScript functions.
Session Management: Manages user interactions programmatically, enabling the crawler to handle authentication and navigation tasks seamlessly.
Dynamic Content Handling: Utilizes Selenium to interact with JavaScript-heavy websites, enabling access to dynamically loaded content.
Multithreading for Operations: Implements threading to concurrently handle multiple tasks, enhancing efficiency and reducing overall processing time.
Error Handling and Resilience: Incorporates try-except blocks and conditional checks to manage unexpected scenarios, such as missing elements or failed operations.

Advantages

Dynamic Content Interaction: Selenium's ability to control a real browser allows the crawler to execute JavaScript, interact with dynamic content, and navigate complex web interfaces effectively.
Comprehensive Data Collection: Facilitates the retrieval of content that is loaded dynamically, ensuring more complete data collection from modern websites.
Automation of Complex Tasks: Capable of automating intricate user interactions, such as form submissions and multi-step navigation processes, enhancing the crawler's versatility.

Limitations

Resource Intensity: Running a full browser instance consumes substantial system resources, including CPU and memory, which can limit scalability and affect host system performance.
Maintenance and Stability: Selenium-based crawlers require ongoing maintenance to accommodate changes in website structures and dynamic content rendering, increasing the complexity of upkeep.
Handling Anti-Bot Measures: Websites may implement measures to detect and block automated bots, such as CAPTCHAs or behavior analysis, posing challenges to the crawler's effectiveness.
Thread Management and Synchronization: Managing multiple threads for concurrent operations introduces complexity, including potential race conditions and synchronization issues, necessitating careful programming to ensure thread safety.

Potential Mitigation Strategies

Headless Browser Configuration: Running browsers in headless mode can reduce resource consumption while maintaining functionality.
Dynamic Selector Updates: Implementing automated or semi-automated methods to update element selectors can enhance the crawler's resilience to website changes.
Advanced Anti-Bot Evasion Techniques: Incorporating strategies such as randomized user-agent strings, varying interaction patterns, and introducing delays can help bypass anti-bot measures.
Efficient Thread Management: Utilizing thread pools and synchronization primitives can optimize thread usage and prevent common multithreading issues.

Overview of the JavaScript-Based Web Crawler and Image Downloader Prototype

Project nGene.org has developed a prototype of a JavaScript-based web crawler and image downloader intended to automate the collection and analysis of web-based biomedical data. This client-side crawler operates within a web browser, enabling the input of a target website URL, specification of the depth of recursion, selection of specific HTML tags to search for, and the decision to limit crawling to the same domain. The following outlines the functionality of this prototype, the challenges encountered—particularly regarding Cross-Origin Resource Sharing (CORS) policies—and its inherent limitations, along with potential strategies to overcome these obstacles.

Web Crawler and Image Downloader

Crawl within the same domain only

Download All Images

(A) Functionality and Features

User Inputs:
- URL Input: Allows entry of the website URL intended for crawling.
- Depth Specification: Enables setting the depth of recursion, where a depth of 0 indicates no following of links beyond the initial page.
- HTML Tag Selection: Facilitates the specification of HTML tags to target, with the default being the anchor tag (<a>).
- Domain Restriction: Provides an option to restrict crawling within the same domain, preventing navigation to external websites.
Crawling Process:
- Fetching Pages: Retrieves the HTML content of the specified URL.
- Parsing HTML: Parses the HTML to identify all elements matching the specified tag.
- Image Extraction and Downloading: Identifies image sources (<img> tags), constructs absolute URLs for these images, and initiates downloads.
- Recursion Control: Based on the specified depth, the crawler can recursively follow links to additional pages within the same domain.
Control Mechanisms:
- Start and Stop Buttons: Allows initiation and termination of the crawling process at any time.
- Progress Indicators: Provides real-time feedback on the crawling status, including errors encountered during the process.

(B) Issues and Limitations

While the JavaScript-based crawler offers a convenient and accessible means of data collection directly from the browser, several significant challenges are encountered:

1. Cross-Origin Resource Sharing (CORS) Policy

CORS is a security feature implemented by web browsers to restrict web pages from making requests to a different domain than the one that served the web page. This ensures that malicious websites cannot access sensitive data from other sites without permission.

Impact on the Crawler: Since the crawler operates within the browser, it is subject to the same CORS restrictions. Attempts to fetch resources (such as HTML pages or images) from different origins will be blocked unless the target server explicitly permits them via appropriate CORS headers.
Example Scenario: Crawling and downloading images from a website that does not set the necessary CORS headers will result in failed requests, preventing access and saving of those images.

2. Same-Origin Policy Constraints

The same-origin policy is a security measure that allows scripts running on a web page to interact only with resources from the same origin (i.e., same domain, protocol, and port). This restricts the crawler from accessing and processing content from external websites unless they are within the same domain or have been configured to allow such interactions.

3. Performance and Scalability

Browser Limitations: Browsers are not optimized for intensive crawling tasks. As the depth of recursion increases, the number of concurrent requests and the volume of data processed can lead to significant slowdowns or even browser crashes.
Resource Consumption: Extensive crawling can consume considerable CPU and memory resources, affecting both the crawler's performance and the overall responsiveness of the browser.

4. Handling Dynamic and JavaScript-Heavy Websites

Dynamic Content Loading: Many modern websites use JavaScript to load content dynamically after the initial page load. The current crawler implementation parses only the static HTML fetched via the fetch API and does not execute or render JavaScript.
Consequences: As a result, any content loaded dynamically may be missed, leading to incomplete data collection. This limitation is particularly problematic for websites that rely heavily on client-side rendering.

(C) CORS Policy Issues and Their Implications

The enforcement of CORS policies presents a significant barrier to the crawler's effectiveness:

Incomplete Data Access: Without proper CORS headers from target websites, the crawler cannot access external resources, limiting its ability to collect comprehensive data and images.
Increased Error Handling: The crawler must handle and log errors resulting from blocked requests, which can complicate the crawling process and reduce overall efficiency.
Restricted Scope: The inability to bypass CORS restrictions confines the crawler's operations primarily to the same domain, reducing its utility for projects requiring data from multiple sources.

(D) Limitations of the Current Implementation

Restricted Access to External Domains: The crawler cannot traverse beyond the initial domain unless the external sites permit cross-origin requests, severely limiting data collection capabilities.
Inability to Execute JavaScript: Since the crawler does not execute JavaScript, it cannot access or interact with dynamically loaded content, resulting in incomplete data retrieval from modern websites.
Browser Performance Constraints: Intensive crawling tasks can degrade browser performance, leading to slowdowns or crashes, especially when handling large volumes of data or deep recursion levels.
Security Concerns: Running a crawler within the browser may expose it to security vulnerabilities, especially if it interacts with untrusted websites or handles sensitive data.

(E) Circumventing CORS and Overcoming Limitations

While CORS policies and other limitations present significant challenges, several strategies can mitigate these issues:

Using a CORS Proxy:
- Definition: A CORS proxy acts as an intermediary between the crawler and the target website, adding the necessary CORS headers to responses.
- Benefits: By routing requests through a CORS proxy, the crawler can bypass browser-enforced CORS restrictions, enabling access to external resources.
- Considerations: Public CORS proxies may have usage limitations or introduce latency. For large-scale or frequent crawling, setting up a dedicated CORS proxy server is advisable.
Server-Side Crawling:
- Approach: Shifting the crawling process to a server-side environment (e.g., using Node.js) bypasses browser-imposed CORS restrictions.
- Advantages: Server-side crawlers are not subject to CORS policies, can handle larger-scale data collection, and can execute JavaScript if needed (using headless browsers like Puppeteer).
- Implementation: Developing a backend service that performs crawling tasks and communicates results to the client-side application.
Leveraging Browser Extensions:
- Strategy: Creating a browser extension with elevated permissions can allow the crawler to access cross-origin resources by modifying request headers.
- Limitations: Developing and distributing browser extensions requires additional effort and may introduce security risks if not properly managed.
Using Headless Browsers:
- Tools: Headless browsers like Puppeteer or Selenium can execute JavaScript, interact with dynamic content, and bypass some CORS restrictions by controlling browser behavior programmatically.
- Benefits: Enhanced capability to handle complex websites and dynamic content, providing more comprehensive data collection.
- Drawbacks: Requires running the crawler outside the standard browser environment, involving more complex setup and resource management.

Implementing Rate Limiting and Throttling:

Purpose: To mitigate performance issues and reduce the risk of overloading target websites, strict rate limiting and request throttling can be implemented.
Method: Introduce delays between requests and limit the number of concurrent fetch operations.

Example Implementation:

async function crawl(url, depth, tag, sameDomainOnly, visited = new Set(), failed = new Set(), baseDomain = null) {
   if (stopCrawling || depth < 0 || visited.has(url) || failed.has(url)) return;

   visited.add(url);

   // ... existing code ...

   // Introduce a delay between requests
   await new Promise(resolve => setTimeout(resolve, 1000)); // 1-second delay

   // ... continue crawling ...
}

Enhancing Error Handling:
- Robust Error Logging: Improve error handling to gracefully manage CORS-related failures and provide meaningful feedback to users.
- Retry Mechanisms: Implement retry logic for transient errors, possibly using exponential backoff strategies to manage repeated request failures.

Conclusion

The JavaScript-based web crawler and image downloader prototype integrated into Project nGene.org offers a user-friendly interface for automated data collection directly within the browser. However, significant challenges related to browser security policies, particularly CORS, and inherent limitations in handling dynamic content and maintaining performance are encountered. By adopting strategies such as using CORS proxies, shifting to server-side crawling, leveraging headless browsers, and implementing robust rate limiting, these limitations can be effectively mitigated. These enhancements will enable the prototype to perform more comprehensive and efficient data collection, thereby supporting the mission to advance hemodynamic research through accurate and extensive biomedical data aggregation.

Crawler JavaScript Source Code with Detailed Comments

Firefox Extension

Developing a Firefox Extension for Automatic Media Downloading

The ability to automatically download images and videos from webpages can enhance productivity and user experience. Implementing this functionality in Firefox can be approached in two primary ways: modifying Firefox's source code or developing a browser extension. This document provides an integrated overview of these methods, focusing on the creation of a Firefox extension due to its practicality and ease of maintenance.

Approaches to Implementing Automatic Media Downloading in Firefox

Modifying Firefox Source Code

Modifying the Firefox source code involves directly editing the browser's internal components to include the desired functionality. While this approach offers deep integration and control, it presents significant challenges:

Complexity and Maintenance: Altering the source code requires extensive knowledge of Firefox's architecture and programming languages like C++, Rust, and JavaScript. Maintaining a custom version necessitates regular updates and recompilation to stay aligned with official releases.
Development Environment Setup: Setting up a development environment for Firefox involves downloading large codebases, managing dependencies, and understanding build processes, which can be time-consuming and error-prone.

Developing a Firefox Extension

Creating a Firefox extension, specifically a WebExtension, is a more practical solution. Extensions are easier to develop, maintain, and distribute. They operate within the browser's existing framework, providing the desired functionality without altering the core code.

Recommended Approach: Developing a Firefox Extension

Programming Languages Used

Firefox extensions utilize standard web technologies, making development accessible:

JavaScript: Core logic and interaction with Firefox APIs.
HTML: Structure of the extension's user interface (if needed).
CSS: Styling for any UI components.
JSON: Configuration and metadata through the manifest.json file.

Overview of Extension Development

Setting Up the Development Environment:
- Install Firefox Developer Edition for advanced debugging features.
- Enable "Developer Mode" in Firefox by navigating to about:debugging.
Creating Essential Files:
- manifest.json: Defines metadata, permissions, and scripts.
- Background Script: Handles media detection and download initiation.
- Content Script: Interacts with webpages to collect media URLs.
Implementing Functionality:
- Media Detection: The content script scans webpages for visible images and videos, collecting their source URLs.
- Automatic Downloading: The background script receives media URLs and uses the Downloads API to save files to the default download directory.
Testing and Deployment:
- Load the extension temporarily in Firefox for testing.
- Optionally, package and publish the extension on Mozilla's Add-ons site for wider distribution.

Functionality of the Extension

How It Works

The extension operates by:

Accessing the Webpage DOM: It runs in the context of the webpage, accessing visible media elements.
Using Standard Browser APIs: Downloads are initiated through standard mechanisms, similar to manual user actions.
Maintaining User Session: When logged into websites, the extension benefits from the existing authenticated session, allowing access to media available to the user.

Handling Logged-In Websites

The extension can download content from websites requiring authentication because:

Session Context: It shares the browser's session cookies and authentication tokens.
Visible Content Access: It only interacts with content that the user can already see and access.

Possible limitations include:

Dynamic Content Loading: Additional scripting may be necessary to handle content loaded asynchronously.
Protected or Encrypted Media: Content with DRM or encryption cannot be downloaded using standard methods.

Considerations Regarding Website Detection

Potential Detection Methods

Websites might detect automated downloading through:

Unusual Traffic Patterns: Rapid or bulk downloads may resemble bot activity.
Request Headers: Inconsistent or atypical headers could raise suspicion.
Resource Access: Attempting to download protected or inaccessible content.

Best Practices to Minimize Detection

Implement Throttling: Introduce delays between downloads to mimic human behavior.
Limit Download Scope: Focus on downloading only visible and relevant media.
Respect Website Policies: Adhere to terms of service and robots.txt guidelines.
Avoid Header Manipulation: Maintain standard request headers to prevent anomalies.
Prevent Server Overload: Avoid excessive simultaneous downloads.

Limitations and Legal Considerations

Platforms with DRM Protections

Websites like YouTube, Netflix, and other streaming services employ DRM technologies that prevent the downloading of their content. The extension:

Cannot Bypass DRM: It is unable to download encrypted or protected media streams.
Is Limited to Accessible Content: Works with media that is directly accessible and not protected by DRM.

Ethical and Legal Concerns

Users should be mindful of:

Copyright Laws: Downloading copyrighted material without permission may infringe on intellectual property rights.
Terms of Service: Violating a website's terms may lead to account suspension or legal action.
Responsible Use: The extension should be used ethically, respecting content ownership and legal restrictions.

Written on November 29th, 2024

Automated Downloading in Firefox Extensions: Minimizing Detection and Ensuring Ethical Use (Written November 30, 2024)

Automated downloading and web scraping can inadvertently trigger detection mechanisms on websites, potentially resulting in blocks or other restrictions. Implementing best practices helps minimize the risk of detection while ensuring responsible and ethical use of automated tools within Firefox extensions. The strategies outlined below provide guidance on emulating human-like behavior, respecting website policies, and preventing server overload.

1. Throttling Downloads

Introducing delays between download requests is essential for mimicking human behavior. Randomized delays make automated activities less distinguishable from those of regular users.

Use Delays Between Requests: Introduce a pause between each download request to avoid rapid succession that could raise suspicion.
Randomize Delay Intervals: Utilize random time intervals to simulate natural user actions.

function startDownloadWithDelay(item, delay) {
    setTimeout(() => {
        startDownload(item);
    }, delay);
}

// Use a random delay between 1 to 3 seconds
const randomDelay = Math.random() * 2000 + 1000; // 1000 to 3000 ms
startDownloadWithDelay(item, randomDelay);

In this example, startDownloadWithDelay introduces a delay before initiating the download. The delay is randomized between 1 to 3 seconds to prevent patterns that might be detected by automated systems.

2. Limiting Download Scope

Focusing on downloading only visible and relevant media reduces the volume of requests and aligns with typical user behavior.

Download Visible Media Only: Limit downloads to media elements currently visible in the viewport.
Use Viewport-Based Filters: Employ functions that check whether an element is within the user's current view.

function isElementInViewport(el) {
    const rect = el.getBoundingClientRect();
    return (
        rect.top >= 0 &&
        rect.left >= 0 &&
        rect.bottom <= (window.innerHeight || document.documentElement.clientHeight) &&
        rect.right <= (window.innerWidth || document.documentElement.clientWidth)
    );
}

The isElementInViewport function determines if a media element is within the visible area of the webpage. By downloading only these elements, the automation mimics typical user interaction with the page.

3. Respecting Website Policies

Adhering to a website's policies and guidelines is essential for ethical automation practices. The robots.txt file provides directives on how automated agents should interact with the site.

Retrieve and Parse robots.txt: Access the robots.txt file to understand the allowed and disallowed paths for automated agents.
Comply with Disallowed Directives: Ensure that the automation does not access or download content from paths specified under Disallow.

Accessing `robots.txt`

Locate the File: Navigate to https://example.com/robots.txt, replacing example.com with the target domain.
Parse the Content: Analyze the file to identify any restrictions applicable to automated downloading.

Example `robots.txt` Content

User-agent: *
Disallow: /private/

In this example, all user agents are instructed not to access the /private/ directory. Automated tools should respect this directive to comply with the website's policies.

4. Avoiding Header Manipulation

Maintaining standard request headers helps prevent anomalies that might trigger detection systems. Custom headers or unusual values can raise red flags.

Use Default Headers: Keep request headers consistent with those sent by standard browsers.
Avoid Unnecessary Modifications: Do not alter headers like Referer or User-Agent unless necessary for functionality.

By adhering to standard header configurations, automated requests appear more like those from regular users, reducing the likelihood of detection.

5. Preventing Server Overload

Excessive simultaneous downloads can strain server resources and negatively impact website performance. Limiting concurrency ensures responsible use of resources.

Limit Concurrent Downloads: Set a maximum number of downloads that can occur simultaneously.
Implement a Queue System: Process downloads incrementally using a queue to manage active requests.

let activeDownloads = 0;
const maxConcurrentDownloads = 5;
const downloadQueue = [];

function processQueue() {
    if (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
        const item = downloadQueue.shift();
        activeDownloads++;
        startDownload(item, () => {
            activeDownloads--;
            processQueue();
        });
    }
}

// Add items to the queue and start processing
downloadQueue.push(...itemsToDownload);
processQueue();

In this code, processQueue manages the download queue by ensuring that no more than five downloads occur at the same time. The startDownload function includes the logic for downloading the item and invokes a callback upon completion.

Written on November 30th, 2024

nGeneAutomaticDownloader: Firefox Extension Documentation (Written November 30, 2024 V1.5)

This document provides a comprehensive explanation of the five scripts used in the nGeneAutomaticDownloader Firefox extension. Each section includes the full script with detailed comments and an explanation of how functions and features are implemented to facilitate easier understanding and maintenance.

Section 1: `manifest.json`

{
  "manifest_version": 2,  // Specifies the version of the manifest file format
  "name": "nGeneAutomaticDownloader",  // The name of the extension
  "version": "1.5",  // The version of the extension
  "description": "Automatically downloads all images and videos from webpages to a 'firefox' folder in your default download directory.",  // A brief description
  "permissions": [
    "downloads",  // Allows use of the downloads API to download files
    "tabs",  // Grants access to browser tabs
    "<all_urls>",  // Allows access to all URLs
    "storage",  // Permits storage and retrieval of data using chrome.storage API
    "webRequest",  // Enables observation and analysis of web requests
    "webRequestBlocking"  // Allows modification or blocking of web requests
  ],
  "background": {
    "scripts": ["background.js"]  // Specifies the background script
  },
  "content_scripts": [
    {
      "matches": ["<all_urls>"],  // The content script will be injected into all pages
      "exclude_matches": ["about:*", "resource://*/*"],  // Excludes internal browser pages
      "js": ["content.js"],  // The content script file
      "run_at": "document_idle"  // Injects the script after the page has loaded
    }
  ],
  "browser_action": {
    "default_title": "nGeneAutomaticDownloader",  // Tooltip text for the browser action icon
    "default_popup": "options.html",  // HTML file displayed when the icon is clicked
    "default_icon": {
      "48": "icons/download-icon.png"  // Icon for the browser action
    }
  },
  "options_ui": {
    "page": "options.html",  // Options page for the extension
    "open_in_tab": false  // Opens the options page as a popup
  },
  "icons": {
    "48": "icons/download-icon.png"  // The extension's icon
  }
}

The manifest.json file is the configuration file for the Firefox extension. It defines essential metadata and specifies the extension's behavior.

manifest_version: Indicates the manifest file format version. Version 2 is used for older extensions.
name: The extension's name displayed in the browser's extension list.
version: The extension's version number, useful for updates.
description: A brief summary of the extension's functionality.
permissions:
- downloads: Allows the extension to download files using the downloads API.
- tabs: Grants access to browser tabs.
- <all_urls>: Permits the extension to access all URLs.
- storage: Enables data storage and retrieval using the chrome.storage API.
- webRequest: Allows the extension to observe and analyze web requests.
- webRequestBlocking: Permits the extension to modify or block web requests.
background: Specifies the background script (background.js) that runs in the background context.
content_scripts:
- matches: Defines URLs where the content script (content.js) will be injected. Here, it matches all URLs.
- exclude_matches: Excludes specific internal browser pages from script injection.
- run_at: Sets the injection timing to after the page has loaded.
browser_action:
- default_title: Tooltip text for the browser action icon.
- default_popup: The HTML file (options.html) displayed when the icon is clicked.
- default_icon: The icon for the browser action.
options_ui:
- page: The options page (options.html) for the extension.
- open_in_tab: Determines whether the options page opens in a new tab or as a popup.
icons: Specifies the extension's icon.

Section 2: `content.js`

(function () {
  // Set to keep track of processed media URLs to prevent duplicates
  const processedMediaUrls = new Set();

  // Main function to process media elements starting from a root node
  function processMediaElements(rootNode) {
    const mediaUrls = []; // Array to collect media URLs found

    // If the root node is not an element or the document itself, exit
    if (rootNode.nodeType !== Node.ELEMENT_NODE && rootNode !== document) {
      return;
    }

    // Nodes to process; start with the root node
    const nodes = rootNode === document ? [document] : [rootNode];

    // Iterate over each node to collect media URLs
    nodes.forEach((node) => {
      // Collect images from <img> tags
      node.querySelectorAll('img').forEach((img) => {
        collectImageFromElement(img, mediaUrls);
      });

      // Collect images from <picture> elements
      node.querySelectorAll('picture source').forEach((source) => {
        collectSrcsetUrls(source, mediaUrls);
      });

      // Collect videos and their source elements
      node.querySelectorAll('video, source').forEach((element) => {
        collectVideoFromElement(element, mediaUrls);
      });

      // Collect images from <object> and <embed> tags
      node.querySelectorAll('object, embed').forEach((element) => {
        collectObjectEmbedMedia(element, mediaUrls);
      });

      // Collect background images from CSS stylesheets
      collectBackgroundImages(mediaUrls);

      // Collect images from inline styles
      collectInlineStyles(mediaUrls);

      // Collect images from <canvas> elements
      node.querySelectorAll('canvas').forEach((canvas) => {
        collectCanvasImage(canvas);
      });

      // Collect images from pseudo-elements (::before and ::after)
      collectPseudoElementImages(mediaUrls);
    });

    // Process the collected media URLs
    processMediaUrls(mediaUrls);
  }

  // Collects image URLs from <img> elements
  function collectImageFromElement(img, mediaUrls) {
    let url = img.src || img.currentSrc; // Get the image source URL
    if (!url) {
      // Check for lazy-loaded images
      url = img.getAttribute('data-src') || img.getAttribute('data-lazy-src');
    }
    if (url) {
      mediaUrls.push(url); // Add the URL to the list
    }
    // Handle srcset attribute for responsive images
    const srcset = img.getAttribute('srcset');
    if (srcset) {
      const srcsetUrls = srcset
        .split(',')
        .map((entry) => entry.trim().split(' ')[0]);
      srcsetUrls.forEach((srcsetUrl) => {
        if (srcsetUrl) {
          mediaUrls.push(srcsetUrl);
        }
      });
    }
  }

  // Collects image URLs from <source> elements in <picture> tags
  function collectSrcsetUrls(element, mediaUrls) {
    const srcset = element.getAttribute('srcset');
    if (srcset) {
      const srcsetUrls = srcset
        .split(',')
        .map((entry) => entry.trim().split(' ')[0]);
      srcsetUrls.forEach((srcsetUrl) => {
        if (srcsetUrl) {
          mediaUrls.push(srcsetUrl);
        }
      });
    }
    // Check for 'src' attribute
    const src = element.getAttribute('src');
    if (src) {
      mediaUrls.push(src);
    }
  }

  // Extracts images from <canvas> elements
  function collectCanvasImage(canvas) {
    try {
      // Convert the canvas content to a data URL
      const dataURL = canvas.toDataURL();
      if (dataURL && !processedMediaUrls.has(dataURL)) {
        processedMediaUrls.add(dataURL); // Mark as processed
        // Send a message to download the data URL
        chrome.runtime.sendMessage(
          { type: 'downloadDataUrl', dataUrl: dataURL },
          function (response) {
            if (chrome.runtime.lastError) {
              console.error(
                `Error sending canvas image: ${chrome.runtime.lastError}`
              );
            }
          }
        );
      }
    } catch (e) {
      console.error('Failed to extract image from canvas:', e);
    }
  }

  // Collects video URLs from <video> and <source> elements
  function collectVideoFromElement(element, mediaUrls) {
    if (element.tagName.toLowerCase() === 'video') {
      let url = element.src || element.currentSrc; // Get the video source URL
      if (!url) {
        url = element.getAttribute('data-src');
      }
      if (url) {
        mediaUrls.push(url);
      }
      // Process <source> elements within the <video>
      element.querySelectorAll('source').forEach((sourceElement) => {
        const sourceUrl =
          sourceElement.src ||
          sourceElement.getAttribute('src') ||
          sourceElement.getAttribute('data-src');
        if (sourceUrl) {
          mediaUrls.push(sourceUrl);
        }
      });
      // Check for 'poster' attribute
      const posterUrl = element.getAttribute('poster');
      if (posterUrl) {
        mediaUrls.push(posterUrl);
      }
    } else if (element.tagName.toLowerCase() === 'source') {
      // For <source> elements outside of <video>
      const sourceUrl =
        element.src ||
        element.getAttribute('src') ||
        element.getAttribute('data-src');
      if (sourceUrl) {
        mediaUrls.push(sourceUrl);
      }
    }
  }

  // Collects media URLs from <object> and <embed> elements
  function collectObjectEmbedMedia(element, mediaUrls) {
    const url = element.data || element.getAttribute('data');
    if (url) {
      mediaUrls.push(url);
    }
  }

  // Collects background images from CSS stylesheets
  function collectBackgroundImages(mediaUrls) {
    for (const sheet of document.styleSheets) {
      let rules;
      try {
        rules = sheet.cssRules; // Get CSS rules
      } catch (e) {
        // Skip cross-origin stylesheets
        continue;
      }

      if (!rules) continue;

      for (const rule of rules) {
        if (rule.type === CSSRule.STYLE_RULE) {
          const style = rule.style;
          const bgImage =
            style.getPropertyValue('background-image') ||
            style.getPropertyValue('background');
          extractUrlsFromStyle(bgImage, mediaUrls);
        } else if (rule.type === CSSRule.MEDIA_RULE) {
          // Handle @media rules
          for (const mediaRule of rule.cssRules) {
            if (mediaRule.type === CSSRule.STYLE_RULE) {
              const style = mediaRule.style;
              const bgImage =
                style.getPropertyValue('background-image') ||
                style.getPropertyValue('background');
              extractUrlsFromStyle(bgImage, mediaUrls);
            }
          }
        }
      }
    }
  }

  // Collects background images from inline styles
  function collectInlineStyles(mediaUrls) {
    document.querySelectorAll('*[style]').forEach((element) => {
      const style = element.getAttribute('style');
      extractUrlsFromStyle(style, mediaUrls);
    });
  }

  // Collects images from pseudo-elements (::before and ::after)
  function collectPseudoElementImages(mediaUrls) {
    document.querySelectorAll('*').forEach((element) => {
      ['::before', '::after'].forEach((pseudo) => {
        const style = getComputedStyle(element, pseudo);
        const bgImage =
          style.getPropertyValue('background-image') ||
          style.getPropertyValue('background');
        extractUrlsFromStyle(bgImage, mediaUrls);
      });
    });
  }

  // Extracts URLs from CSS style properties
  function extractUrlsFromStyle(styleValue, mediaUrls) {
    if (styleValue && styleValue !== 'none') {
      // Match URLs in the style value
      const urls = styleValue.match(/url\(["']?([^"')]+)["']?\)/g);
      if (urls) {
        urls.forEach((urlString) => {
          const url = urlString.match(/url\(["']?([^"')]+)["']?\)/)[1];
          if (url) {
            // Resolve relative URLs to absolute URLs
            const absoluteUrl = new URL(url, location.href).href;
            mediaUrls.push(absoluteUrl);
          }
        });
      }
    }
  }

  // Processes the collected media URLs
  function processMediaUrls(mediaUrls) {
    // Remove duplicates and already processed URLs
    const uniqueUrls = Array.from(new Set(mediaUrls));
    uniqueUrls.forEach((url) => {
      const cleanUrl = url.split('#')[0]; // Remove fragment identifiers

      if (processedMediaUrls.has(cleanUrl)) {
        return; // Skip already processed URLs
      }

      processedMediaUrls.add(cleanUrl); // Mark as processed

      // Handle data URLs
      if (url.startsWith('data:')) {
        // Send a message to download the data URL
        chrome.runtime.sendMessage(
          { type: 'downloadDataUrl', dataUrl: url },
          function (response) {
            if (chrome.runtime.lastError) {
              console.error(
                `Error sending data URL: ${chrome.runtime.lastError}`
              );
            }
          }
        );
        return;
      }

      let filename;
      try {
        const urlObj = new URL(url, location.href); // Create a URL object
        filename = urlObj.pathname.split('/').pop(); // Extract the filename
        if (!filename || filename.length === 0) {
          filename = 'unnamed'; // Default filename
        }
        // Try to get file extension
        let extension = filename.includes('.') ? filename.split('.').pop() : '';
        if (!extension) {
          // Guess extension from MIME type
          const mimeType = urlObj.searchParams.get('type') || '';
          if (mimeType) {
            extension = mimeType.split('/').pop();
          }
        }
        if (extension) {
          filename += '.' + extension;
        }
      } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return;
      }

      // Send a message to download the file
      chrome.runtime.sendMessage(
        { type: 'download', url: url, filename: filename },
        function (response) {
          if (chrome.runtime.lastError) {
            console.error(
              `Error sending message for ${filename}: ${chrome.runtime.lastError}`
            );
          }
        }
      );
    });
  }

  // Observes changes in the DOM to detect new media elements
  const observer = new MutationObserver((mutations) => {
    mutations.forEach((mutation) => {
      if (mutation.type === 'childList') {
        // If nodes are added to the DOM
        mutation.addedNodes.forEach((node) => {
          if (node.nodeType === Node.ELEMENT_NODE) {
            processMediaElements(node); // Process the new node
            // Also process media elements within this node
            node
              .querySelectorAll(
                'img, video, source, picture source, object, embed, canvas'
              )
              .forEach((element) => {
                processMediaElements(element);
              });
          }
        });
      } else if (mutation.type === 'attributes') {
        // If attributes of an element have changed
        if (mutation.target && mutation.target.nodeType === Node.ELEMENT_NODE) {
          // Check if the changed attribute is relevant
          const relevantAttributes = [
            'src',
            'srcset',
            'style',
            'data-src',
            'data-lazy-src',
            'poster',
            'data',
            'href',
          ];
          if (relevantAttributes.includes(mutation.attributeName)) {
            processMediaElements(mutation.target); // Process the element
          }
        }
      }
    });
  });

  // Start observing the document for changes
  observer.observe(document, {
    childList: true, // Observe when nodes are added or removed
    subtree: true, // Observe all descendant nodes
    attributes: true, // Observe attribute changes
    attributeFilter: [
      'src',
      'srcset',
      'style',
      'data-src',
      'data-lazy-src',
      'poster',
      'data',
      'href',
    ], // Attributes to observe
  });

  // Listen for user interactions to trigger processing
  ['click', 'scroll', 'mousemove', 'touchstart', 'touchmove'].forEach(
    (event) => {
      document.addEventListener(event, () => {
        processMediaElements(document);
      });
    }
  );

  // Initial processing when the window loads
  window.addEventListener('load', () => {
    processMediaElements(document);
  });
})();

The content.js script is a content script that runs in the context of web pages. Its primary purpose is to identify and collect all media elements (images, videos, etc.) on a web page and send messages to the background script to download these media files.

IIFE (Immediately Invoked Function Expression): The entire script is wrapped in an IIFE to avoid polluting the global namespace.
processedMediaUrls: A Set used to keep track of media URLs that have already been processed, preventing duplicate downloads.
processMediaElements(rootNode): The main function that processes media elements starting from a given root node. It collects media URLs from various elements such as <img>, <picture>, <video>, <source>, <object>, <embed>, <canvas>, and also from CSS styles and pseudo-elements.
Collecting Media URLs:
- Images:
  - collectImageFromElement(img, mediaUrls): Collects image URLs from <img> elements, handling src, currentSrc, data-src, data-lazy-src, and srcset attributes.
  - collectSrcsetUrls(element, mediaUrls): Collects image URLs from <source> elements within <picture> tags, processing srcset and src attributes.
- Videos:
  - collectVideoFromElement(element, mediaUrls): Collects video URLs from <video> and <source> elements, handling src, currentSrc, data-src, and poster attributes.
- Other Media:
  - collectObjectEmbedMedia(element, mediaUrls): Collects media URLs from <object> and <embed> elements.
- Canvas:
  - collectCanvasImage(canvas): Extracts images from <canvas> elements by converting the canvas content to a data URL.
- CSS Background Images:
  - collectBackgroundImages(mediaUrls): Collects background images specified in CSS stylesheets.
  - collectInlineStyles(mediaUrls): Collects background images from inline style attributes.
  - collectPseudoElementImages(mediaUrls): Collects images from CSS pseudo-elements ::before and ::after.
extractUrlsFromStyle(styleValue, mediaUrls): A helper function that extracts URLs from CSS style properties using regular expressions.
Processing Collected Media URLs:
- processMediaUrls(mediaUrls): Processes the collected media URLs by removing duplicates and already processed URLs. It handles data URLs and regular URLs differently:
  1. Data URLs: Sends a message to the background script to download them.
  2. Regular URLs: Extracts the filename, sanitizes it, and sends a download request to the background script.
Mutation Observer:
- observer: Observes changes in the DOM, such as the addition of new nodes or changes to attributes, to detect dynamically loaded media.
User Interaction Event Listeners: Listens for events like click, scroll, mousemove, touchstart, and touchmove to trigger reprocessing of the page, capturing media that loads upon user interaction.
Initial Processing: Performs initial media processing when the window loads.

Section 3: `background.js`

// Variables to manage downloads and settings
let downloadQueue = []; // Queue for download requests
let activeDownloads = 0; // Number of active downloads
let maxConcurrentDownloads = 10; // Default maximum concurrent downloads
let keepTrack = true; // Whether to keep track of downloaded filenames
let minFileSize = 50 * 1024; // Minimum file size in bytes (default 50 KB)
let downloadedFilenames = new Set(); // Set to store filenames of downloaded files
let downloadedUrls = new Set(); // Set to store URLs of downloaded files

// Load initial settings from chrome.storage.local
chrome.storage.local.get(
  [
    'threads',
    'keepTrack',
    'minFileSize',
    'downloadedFilenames',
    'downloadedUrls',
  ],
  (result) => {
    // Update variables with saved settings or use defaults
    maxConcurrentDownloads = result.threads || 10;
    keepTrack = result.keepTrack !== false; // Default to true if undefined
    minFileSize = (result.minFileSize || 50) * 1024; // Convert KB to bytes
    if (result.downloadedFilenames && Array.isArray(result.downloadedFilenames)) {
      downloadedFilenames = new Set(result.downloadedFilenames); // Initialize set with saved filenames
    }
    if (result.downloadedUrls && Array.isArray(result.downloadedUrls)) {
      downloadedUrls = new Set(result.downloadedUrls); // Initialize set with saved URLs
    }
  }
);

// Listener for messages from other parts of the extension
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  if (message.type === 'updateSettings') {
    // Update settings when they are changed in options
    chrome.storage.local.get(
      ['threads', 'keepTrack', 'minFileSize'],
      (result) => {
        maxConcurrentDownloads = result.threads || 10;
        keepTrack = result.keepTrack !== false;
        minFileSize = (result.minFileSize || 50) * 1024;
        console.log(
          `Updated settings: maxConcurrentDownloads = ${maxConcurrentDownloads}, keepTrack = ${keepTrack}, minFileSize = ${minFileSize} bytes`
        );
      }
    );
  } else if (message.type === 'resetQueue') {
    // Reset the download queue
    downloadQueue = [];
    sendResponse({ status: 'success' });
  } else if (message.type === 'clearHistory') {
    // Clear the set of downloaded filenames and URLs
    downloadedFilenames.clear();
    downloadedUrls.clear();
    chrome.storage.local.set(
      { downloadedFilenames: [], downloadedUrls: [] },
      () => {
        sendResponse({ status: 'success' });
      }
    );
    return true; // Keep the message channel open for sendResponse
  } else if (message.type === 'download') {
    // Handle download request from content script
    const url = message.url;
    const filename = message.filename;

    // Check if the file has already been downloaded
    if (keepTrack && downloadedUrls.has(url)) {
      console.log(
        `Skipping download for URL ${url} as it has already been downloaded.`
      );
      sendResponse({ status: 'skipped' });
      return;
    }

    // Add the download request to the queue
    downloadQueue.push({ url: url, filename: filename });

    // Start processing the queue
    processQueue();

    sendResponse({ status: 'queued' });
  } else if (message.type === 'downloadDataUrl') {
    // Handle download request for data URL
    const dataUrl = message.dataUrl;
    const filename = `image_${Date.now()}.png`; // Generate a unique filename

    // Convert data URL to Blob
    fetch(dataUrl)
      .then((res) => res.blob())
      .then((blob) => {
        const url = URL.createObjectURL(blob);
        downloadQueue.push({ url: url, filename: filename, revokeUrl: true });
        processQueue();
        sendResponse({ status: 'queued' });
      })
      .catch((error) => {
        console.error(`Failed to download data URL: ${error}`);
        sendResponse({ status: 'error', error: error.toString() });
      });
    return true; // Keep the message channel open for sendResponse
  }
});

// Function to process the download queue
function processQueue() {
  // Continue processing while there are slots for active downloads
  while (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
    const item = downloadQueue.shift(); // Get the next item from the queue
    startDownload(item); // Start the download
  }
}

// Function to start a download
function startDownload(item) {
  activeDownloads++; // Increment the number of active downloads

  const filename = item.filename || item.url.split('/').pop();
  const fullFilename = `firefox/${sanitizeFilename(filename)}`; // Prepend 'firefox/' to create a subdirectory

  chrome.downloads.download(
    {
      url: item.url,
      filename: fullFilename,
      saveAs: false, // Do not prompt the user for save location
      conflictAction: 'overwrite', // Overwrite existing files
    },
    (downloadId) => {
      if (chrome.runtime.lastError) {
        // Handle errors during download initiation
        console.error(
          `Download failed for ${filename}: ${chrome.runtime.lastError}`
        );
        activeDownloads--; // Decrement active downloads
        processQueue(); // Try the next item in the queue
      } else {
        console.log(
          `Download started: ID = ${downloadId}, filename = ${filename}`
        );

        // Listener for changes in the download state
        function onChanged(delta) {
          if (
            delta.id === downloadId &&
            delta.state &&
            delta.state.current === 'complete'
          ) {
            // Download completed successfully
            chrome.downloads.search({ id: downloadId }, function (items) {
              if (items && items.length > 0) {
                const downloadItem = items[0];
                const fileSize =
                  downloadItem.fileSize || downloadItem.totalBytes; // Get the file size
                if (fileSize < minFileSize) {
                  // File is smaller than the minimum size
                  chrome.downloads.removeFile(downloadId, function () {
                    if (chrome.runtime.lastError) {
                      console.error(
                        `Failed to remove file: ${chrome.runtime.lastError}`
                      );
                    } else {
                      console.log(
                        `Removed file ${filename} (size ${fileSize} bytes) because it is smaller than the minimum size (${minFileSize} bytes).`
                      );
                    }
                    // Remove the download from history
                    chrome.downloads.erase({ id: downloadId });
                  });
                } else {
                  // File meets the size requirement
                  if (keepTrack) {
                    // Add the filename and URL to the sets
                    downloadedFilenames.add(filename);
                    downloadedUrls.add(item.url);
                    // Update the stored sets
                    chrome.storage.local.set({
                      downloadedFilenames: Array.from(downloadedFilenames),
                      downloadedUrls: Array.from(downloadedUrls),
                    });
                  }
                }
              }
            });

            // Cleanup after download completion
            chrome.downloads.onChanged.removeListener(onChanged);
            activeDownloads--; // Decrement active downloads
            processQueue(); // Process the next item

            // Revoke object URL if necessary
            if (item.revokeUrl) {
              URL.revokeObjectURL(item.url);
            }
          } else if (
            delta.id === downloadId &&
            delta.state &&
            delta.state.current === 'interrupted'
          ) {
            // Download was interrupted
            console.error(`Download interrupted for ${filename}`);
            chrome.downloads.onChanged.removeListener(onChanged);
            activeDownloads--; // Decrement active downloads
            processQueue(); // Process the next item

            // Revoke object URL if necessary
            if (item.revokeUrl) {
              URL.revokeObjectURL(item.url);
            }
          }
        }

        // Add the listener to monitor download changes
        chrome.downloads.onChanged.addListener(onChanged);
      }
    }
  );
}

// Function to sanitize filenames
function sanitizeFilename(filename) {
  return filename.replace(/[\\/:*?"<>|]/g, '_');
}

// Use the webRequest API to monitor network requests
chrome.webRequest.onCompleted.addListener(
  (details) => {
    // Check if the request is for an image or video
    if (details.type === 'image' || details.type === 'media') {
      const url = details.url;

      // Extract filename from URL
      let filename;
      try {
        const urlObj = new URL(url);
        filename = urlObj.pathname.split('/').pop();
        if (!filename || filename.length === 0) {
          filename = 'unnamed';
        }
      } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return;
      }

      filename = filename.split('?')[0];

      // Sanitize filename
      filename = sanitizeFilename(filename);

      // Check if the file has already been downloaded
      if (keepTrack && downloadedUrls.has(url)) {
        console.log(
          `Skipping download for URL ${url} as it has already been downloaded.`
        );
        return;
      }

      // Add the download request to the queue
      downloadQueue.push({ url: url, filename: filename });

      // Start processing the queue
      processQueue();
    }
  },
  { urls: ['<all_urls>'] },
  []
);

The background.js script runs in the background context of the extension. It manages the downloading of media files, handles settings, and communicates with the content script and options page.

Variables:
- downloadQueue: An array serving as a queue for download requests.
- activeDownloads: Tracks the number of downloads currently in progress.
- maxConcurrentDownloads: Maximum number of concurrent downloads allowed.
- keepTrack: Indicates whether to track downloaded filenames and URLs to avoid duplicates.
- minFileSize: The minimum file size (in bytes) required for a file to be kept.
- downloadedFilenames and downloadedUrls: Sets to store filenames and URLs of downloaded files.
Loading Initial Settings: Retrieves saved settings and updates variables accordingly.
Message Listener: Listens for messages from other parts of the extension and handles different message types:
1. 'updateSettings': Reloads settings when changed.
2. 'resetQueue': Resets the download queue.
3. 'clearHistory': Clears the history of downloaded filenames and URLs.
4. 'download': Handles download requests from the content script.
5. 'downloadDataUrl': Handles download requests for data URLs.
processQueue(): Processes the download queue while there are available slots for active downloads.
startDownload(item): Initiates the download of an item and sets up listeners to monitor the download state.
sanitizeFilename(filename): Replaces invalid characters in filenames.
Monitoring Network Requests: Uses the webRequest API to monitor completed web requests and adds relevant media URLs to the download queue.

Section 4: `options.js`

// Wait until the DOM content is fully loaded
document.addEventListener('DOMContentLoaded', () => {
  // References to the HTML elements in options.html
  const threadsInput = document.getElementById('threads'); // Input for concurrent downloads
  const keepTrackCheckbox = document.getElementById('keepTrack'); // Checkbox for tracking filenames
  const minFileSizeInput = document.getElementById('minFileSize'); // Input for minimum file size
  const saveButton = document.getElementById('save'); // Button to save settings
  const resetButton = document.getElementById('reset'); // Button to reset the download queue
  const clearHistoryButton = document.getElementById('clearHistory'); // Button to clear history

  // Load saved settings from chrome.storage.local
  chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
    // Set input values to saved settings or defaults
    threadsInput.value = result.threads || 10; // Default to 10 threads
    keepTrackCheckbox.checked = result.keepTrack !== false; // Default to true
    minFileSizeInput.value = result.minFileSize || 50; // Default to 50 KB
  });

  // Event listener for the Save Settings button
  saveButton.addEventListener('click', () => {
    // Retrieve values from the inputs
    const threads = parseInt(threadsInput.value) || 10;
    const keepTrack = keepTrackCheckbox.checked;
    const minFileSize = parseInt(minFileSizeInput.value) || 50;

    // Save the settings to chrome.storage.local
    chrome.storage.local.set(
      { threads: threads, keepTrack: keepTrack, minFileSize: minFileSize },
      () => {
        alert('Settings saved.');
        // Notify the background script to update settings
        chrome.runtime.sendMessage({ type: 'updateSettings' });
      }
    );
  });

  // Event listener for the Reset Download Queue button
  resetButton.addEventListener('click', () => {
    // Send a message to reset the queue
    chrome.runtime.sendMessage({ type: 'resetQueue' }, (response) => {
      if (response.status === 'success') {
        alert('Download queue reset.');
      } else {
        alert('Failed to reset queue.');
      }
    });
  });

  // Event listener for the Clear Downloaded Filenames button
  clearHistoryButton.addEventListener('click', () => {
    // Send a message to clear the history
    chrome.runtime.sendMessage({ type: 'clearHistory' }, (response) => {
      if (response.status === 'success') {
        alert('Downloaded filenames history cleared.');
      } else {
        alert('Failed to clear history.');
      }
    });
  });
});

The options.js script handles the user interface of the extension's options page, allowing users to adjust settings and perform actions.

DOM Content Loaded Event: Ensures that the script runs after the page's content has fully loaded.
References to HTML Elements:
- threadsInput: Input field for the number of concurrent downloads.
- keepTrackCheckbox: Checkbox for enabling/disabling tracking of downloaded filenames.
- minFileSizeInput: Input field for the minimum file size.
- saveButton: Button to save settings.
- resetButton: Button to reset the download queue.
- clearHistoryButton: Button to clear the download history.
Loading Saved Settings: Retrieves saved settings from chrome.storage.local and updates the input fields.
Event Listeners:
1. Save Settings: Saves the user's settings and notifies the background script.
2. Reset Download Queue: Sends a message to reset the download queue and alerts the user.
3. Clear Downloaded Filenames: Sends a message to clear the download history and alerts the user.

Section 5: `options.html`

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>nGeneAutomaticDownloader Options</title>
  <style>
    /* Basic styling for the options page */
    body { font-family: Arial, sans-serif; padding: 10px; }
    label { display: block; margin-bottom: 5px; }
    input[type="number"] { width: 50px; }
    button { margin-top: 10px; margin-right: 10px; }
  </style>
</head>
<body>
  <h2>Extension Settings</h2>

  <!-- Setting for the number of concurrent downloads -->
  <label>
    Number of Concurrent Downloads:
    <input type="number" id="threads" min="1" max="10">
  </label>

  <!-- Setting to keep track of downloaded filenames -->
  <label>
    Keep track of downloaded filenames:
    <input type="checkbox" id="keepTrack">
  </label>

  <!-- Setting for the minimum file size threshold -->
  <label>
    Minimum File Size (KB):
    <input type="number" id="minFileSize" min="0" step="1">
  </label>

  <!-- Buttons to save settings, reset download queue, and clear history -->
  <button id="save">Save Settings</button>
  <button id="reset">Reset Download Queue</button>
  <button id="clearHistory">Clear Downloaded Filenames</button>

  <!-- Include the options.js script -->
  <script src="options.js"></script>
</body>
</html>

The options.html file defines the user interface for the extension's options page.

HTML Structure:
- Head Section:
  - meta charset: Specifies the character encoding.
  - title: Sets the page title.
  - style: Contains basic CSS styling for the page.
- Body:
  - h2: "Extension Settings" heading.
  - Settings Form:
    - Number of Concurrent Downloads: Input field with id="threads".
    - Keep Track of Downloaded Filenames: Checkbox with id="keepTrack".
    - Minimum File Size (KB): Input field with id="minFileSize".
  - Action Buttons:
    - Save Settings: Button with id="save".
    - Reset Download Queue: Button with id="reset".
    - Clear Downloaded Filenames: Button with id="clearHistory".
- Script Inclusion: Includes the options.js script for interactivity.
Styling: Basic CSS styling is applied to ensure a clean and user-friendly interface.

Written on November 30th, 2024

nGeneAutomaticDownloader Extension v1.6 (Written November 30, 2024)

The nGeneAutomaticDownloader is a Firefox extension designed to automatically download all images and videos from webpages, organizing them into a designated 'firefox' folder within the user's default download directory. This document provides a comprehensive overview of the extension's implementation across its constituent files. It elucidates the functionalities and features embedded within each script, offering insights into the mechanisms employed for thread management, download tracking, file type handling, and user interface configuration. This overview serves as a reference to facilitate future reviews and modifications of the extension's codebase.

Code Listings

manifest.json

{
  "manifest_version": 2,
  "name": "nGeneAutomaticDownloader",
  "version": "1.6",
  "description": "Automatically downloads all images and videos from webpages to a 'firefox' folder in your default download directory.",
  "permissions": [
    "downloads",
    "tabs",
    "<all_urls>",
    "storage",
    "webRequest",
    "webRequestBlocking"
  ],
  "background": {
    "scripts": ["background.js"]
  },
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "exclude_matches": ["about:*", "resource://*/*"],
      "js": ["content.js"],
      "run_at": "document_idle"
    }
  ],
  "browser_action": {
    "default_title": "nGeneAutomaticDownloader",
    "default_popup": "options.html",
    "default_icon": {
      "48": "icons/download-icon.png"
    }
  },
  "options_ui": {
    "page": "options.html",
    "open_in_tab": false
  },
  "icons": {
    "48": "icons/download-icon.png"
  }
}

content.js

(function () {
  // Set to keep track of processed media URLs to prevent duplicates
  const processedMediaUrls = new Set();

  // Main function to process media elements starting from a root node
  function processMediaElements(rootNode) {
    const mediaUrls = [];

    if (!rootNode) return;

    // Use a Set to avoid processing the same node multiple times
    const nodesToProcess = new Set();

    // Collect nodes to process
    function collectNodes(node) {
      if (node.nodeType !== Node.ELEMENT_NODE) return;

      nodesToProcess.add(node);

      // Recursively collect child nodes
      node.querySelectorAll('*').forEach((child) => {
        nodesToProcess.add(child);
      });
    }

    collectNodes(rootNode);

    // Process each node
    nodesToProcess.forEach((node) => {
      // Process media elements based on their tag names
      const tagName = node.tagName.toLowerCase();

      if (tagName === 'img') {
        collectImageFromElement(node, mediaUrls);
      } else if (tagName === 'video' || tagName === 'audio') {
        collectMediaFromElement(node, mediaUrls);
      } else if (tagName === 'source') {
        collectSourceFromElement(node, mediaUrls);
      } else if (tagName === 'picture') {
        collectPictureSources(node, mediaUrls);
      } else if (tagName === 'object' || tagName === 'embed') {
        collectObjectEmbedMedia(node, mediaUrls);
      } else if (tagName === 'canvas') {
        collectCanvasImage(node);
      }
    });

    // Collect background images from styles
    collectBackgroundImages(mediaUrls);
    collectInlineStyles(mediaUrls);
    collectPseudoElementImages(mediaUrls);

    // Process the collected media URLs
    processMediaUrls(mediaUrls);
  }

  // Collect image URLs from <img> elements
  function collectImageFromElement(img, mediaUrls) {
    const urls = [];

    // src attribute
    if (img.src) {
      urls.push(img.src);
    }

    // data-src or data-lazy-src attributes for lazy-loaded images
    const dataSrc = img.getAttribute('data-src') || img.getAttribute('data-lazy-src');
    if (dataSrc) {
      urls.push(dataSrc);
    }

    // srcset attribute
    const srcset = img.getAttribute('srcset');
    if (srcset) {
      const srcsetUrls = srcset
        .split(',')
        .map((entry) => entry.trim().split(' ')[0])
        .filter((url) => url);
      urls.push(...srcsetUrls);
    }

    // Add collected URLs to mediaUrls
    mediaUrls.push(...urls);
  }

  // Collect media URLs from <video> and <audio> elements
  function collectMediaFromElement(mediaElement, mediaUrls) {
    const urls = [];

    // src attribute
    if (mediaElement.src) {
      urls.push(mediaElement.src);
    }

    // data-src attribute
    const dataSrc = mediaElement.getAttribute('data-src');
    if (dataSrc) {
      urls.push(dataSrc);
    }

    // Poster attribute (for videos)
    const poster = mediaElement.getAttribute('poster');
    if (poster) {
      urls.push(poster);
    }

    // Collect from child <source> elements
    mediaElement.querySelectorAll('source').forEach((source) => {
      collectSourceFromElement(source, urls);
    });

    // Add collected URLs to mediaUrls
    mediaUrls.push(...urls);
  }

  // Collect URLs from <source> elements
  function collectSourceFromElement(sourceElement, mediaUrls) {
    const urls = [];

    // src attribute
    const src = sourceElement.src || sourceElement.getAttribute('src');
    if (src) {
      urls.push(src);
    }

    // data-src attribute
    const dataSrc = sourceElement.getAttribute('data-src');
    if (dataSrc) {
      urls.push(dataSrc);
    }

    // srcset attribute
    const srcset = sourceElement.getAttribute('srcset');
    if (srcset) {
      const srcsetUrls = srcset
        .split(',')
        .map((entry) => entry.trim().split(' ')[0])
        .filter((url) => url);
      urls.push(...srcsetUrls);
    }

    // Add collected URLs to mediaUrls
    mediaUrls.push(...urls);
  }

  // Collect sources from <picture> elements
  function collectPictureSources(pictureElement, mediaUrls) {
    pictureElement.querySelectorAll('source').forEach((source) => {
      collectSourceFromElement(source, mediaUrls);
    });
  }

  // Collect media from <object> and <embed> elements
  function collectObjectEmbedMedia(element, mediaUrls) {
    const data = element.getAttribute('data');
    if (data) {
      mediaUrls.push(data);
    }
    const src = element.getAttribute('src');
    if (src) {
      mediaUrls.push(src);
    }
  }

  // Collect images from <canvas> elements
  function collectCanvasImage(canvas) {
    try {
      const dataURL = canvas.toDataURL();
      if (dataURL && !processedMediaUrls.has(dataURL)) {
        processedMediaUrls.add(dataURL);
        chrome.runtime.sendMessage(
          { type: 'downloadDataUrl', dataUrl: dataURL },
          function (response) {
            if (chrome.runtime.lastError) {
              console.error(`Error sending canvas image: ${chrome.runtime.lastError}`);
            }
          }
        );
      }
    } catch (e) {
      console.error('Failed to extract image from canvas:', e);
    }
  }

  // Collect background images from CSS stylesheets
  function collectBackgroundImages(mediaUrls) {
    for (const sheet of document.styleSheets) {
      let rules;
      try {
        rules = sheet.cssRules;
      } catch (e) {
        continue; // Skip cross-origin stylesheets
      }

      if (!rules) continue;

      for (const rule of rules) {
        if (rule.type === CSSRule.STYLE_RULE) {
          const style = rule.style;
          const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
          extractUrlsFromStyle(bgImage, mediaUrls);
        } else if (rule.type === CSSRule.MEDIA_RULE) {
          for (const mediaRule of rule.cssRules) {
            if (mediaRule.type === CSSRule.STYLE_RULE) {
              const style = mediaRule.style;
              const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
              extractUrlsFromStyle(bgImage, mediaUrls);
            }
          }
        }
      }
    }
  }

  // Collect background images from inline styles
  function collectInlineStyles(mediaUrls) {
    document.querySelectorAll('*[style]').forEach((element) => {
      const style = element.getAttribute('style');
      extractUrlsFromStyle(style, mediaUrls);
    });
  }

  // Collect images from pseudo-elements
  function collectPseudoElementImages(mediaUrls) {
    document.querySelectorAll('*').forEach((element) => {
      ['::before', '::after'].forEach((pseudo) => {
        const style = getComputedStyle(element, pseudo);
        const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
        extractUrlsFromStyle(bgImage, mediaUrls);
      });
    });
  }

  // Extract URLs from CSS style properties
  function extractUrlsFromStyle(styleValue, mediaUrls) {
    if (styleValue && styleValue !== 'none') {
      const urls = styleValue.match(/url\\(["']?([^"')]+)["']?\\)/g);
      if (urls) {
        urls.forEach((urlString) => {
          const url = urlString.match(/url\\(["']?([^"')]+)["']?\\)/)[1];
          if (url) {
            const absoluteUrl = new URL(url, location.href).href;
            mediaUrls.push(absoluteUrl);
          }
        });
      }
    }
  }

  // Process collected media URLs
  function processMediaUrls(mediaUrls) {
    const uniqueUrls = Array.from(new Set(mediaUrls));

    uniqueUrls.forEach((url) => {
      const cleanUrl = url.split('#')[0];

      if (processedMediaUrls.has(cleanUrl)) {
        return;
      }

      processedMediaUrls.add(cleanUrl);

      // Handle data URLs
      if (url.startsWith('data:')) {
        chrome.runtime.sendMessage(
          { type: 'downloadDataUrl', dataUrl: url },
          function (response) {
            if (chrome.runtime.lastError) {
              console.error(`Error sending data URL: ${chrome.runtime.lastError}`);
            }
          }
        );
        return;
      }

      let filename;
      try {
        const urlObj = new URL(url, location.href);
        filename = urlObj.pathname.split('/').pop() || 'unnamed';
      } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return;
      }

      // Send message to background script to download the file
      chrome.runtime.sendMessage(
        { type: 'download', url: url, filename: filename },
        function (response) {
          if (chrome.runtime.lastError) {
            console.error(`Error sending message for ${filename}: ${chrome.runtime.lastError}`);
          }
        }
      );
    });
  }

  // Enhance MutationObserver to detect attribute changes and added nodes
  const observer = new MutationObserver((mutations) => {
    mutations.forEach((mutation) => {
      if (mutation.type === 'childList') {
        // Process added nodes
        mutation.addedNodes.forEach((node) => {
          processMediaElements(node);
        });
      } else if (mutation.type === 'attributes') {
        processMediaElements(mutation.target);
      }
    });
  });

  // Start observing the document for changes
  observer.observe(document, {
    childList: true,
    subtree: true,
    attributes: true,
    attributeFilter: [
      'src',
      'srcset',
      'data-src',
      'data-lazy-src',
      'poster',
      'style',
      'data',
      'href',
    ],
  });

  // Initial processing
  processMediaElements(document);

  // Re-process periodically to catch any missed elements
  setInterval(() => {
    processMediaElements(document);
  }, 5000); // Adjust the interval as needed
})();

background.js

// Variables to manage downloads and settings
let downloadQueue = [];
let activeDownloads = 0;
let maxConcurrentDownloads = 10;
let keepTrack = true;
let minFileSize = 50 * 1024;
let downloadedFilenames = new Set();
let downloadedUrls = new Set();

// Load initial settings from storage
chrome.storage.local.get(
  [
    'threads',
    'keepTrack',
    'minFileSize',
    'downloadedFilenames',
    'downloadedUrls',
  ],
  (result) => {
    maxConcurrentDownloads = result.threads || 10;
    keepTrack = result.keepTrack !== false;
    minFileSize = (result.minFileSize || 50) * 1024;
    if (result.downloadedFilenames && Array.isArray(result.downloadedFilenames)) {
      downloadedFilenames = new Set(result.downloadedFilenames);
    }
    if (result.downloadedUrls && Array.isArray(result.downloadedUrls)) {
      downloadedUrls = new Set(result.downloadedUrls);
    }
  }
);

// Listener for messages from content scripts
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  if (message.type === 'updateSettings') {
    // Update settings
    chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
      maxConcurrentDownloads = result.threads || 10;
      keepTrack = result.keepTrack !== false;
      minFileSize = (result.minFileSize || 50) * 1024;
      console.log(
        `Updated settings: maxConcurrentDownloads = ${maxConcurrentDownloads}, keepTrack = ${keepTrack}, minFileSize = ${minFileSize} bytes`
      );
    });
  } else if (message.type === 'resetQueue') {
    downloadQueue = [];
    sendResponse({ status: 'success' });
  } else if (message.type === 'clearHistory') {
    downloadedFilenames.clear();
    downloadedUrls.clear();
    chrome.storage.local.set(
      { downloadedFilenames: [], downloadedUrls: [] },
      () => {
        sendResponse({ status: 'success' });
      }
    );
    return true;
  } else if (message.type === 'download') {
    const url = message.url;
    const filename = message.filename;

    if (keepTrack && downloadedUrls.has(url)) {
      console.log(`Skipping download for URL ${url} as it has already been downloaded.`);
      sendResponse({ status: 'skipped' });
      return;
    }

    downloadQueue.push({ url: url, filename: filename });
    processQueue();
    sendResponse({ status: 'queued' });
  } else if (message.type === 'downloadDataUrl') {
    const dataUrl = message.dataUrl;
    const filename = `image_${Date.now()}.png`;

    fetch(dataUrl)
      .then((res) => res.blob())
      .then((blob) => {
        const url = URL.createObjectURL(blob);
        downloadQueue.push({ url: url, filename: filename, revokeUrl: true });
        processQueue();
        sendResponse({ status: 'queued' });
      })
      .catch((error) => {
        console.error(`Failed to download data URL: ${error}`);
        sendResponse({ status: 'error', error: error.toString() });
      });
    return true;
  }
});

// Function to process the download queue
function processQueue() {
  while (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
    const item = downloadQueue.shift();
    startDownload(item);
  }
}

// Function to start a download
function startDownload(item) {
  activeDownloads++;

  const filename = item.filename || item.url.split('/').pop();
  const fullFilename = `firefox/${sanitizeFilename(filename)}`;

  chrome.downloads.download(
    {
      url: item.url,
      filename: fullFilename,
      saveAs: false,
      conflictAction: 'overwrite',
    },
    (downloadId) => {
      if (chrome.runtime.lastError) {
        console.error(`Download failed for ${filename}: ${chrome.runtime.lastError}`);
        activeDownloads--;
        processQueue();
      } else {
        console.log(`Download started: ID = ${downloadId}, filename = ${filename}`);

        // Listener for changes in the download state
        function onChanged(delta) {
          if (
            delta.id === downloadId &&
            delta.state &&
            delta.state.current === 'complete'
          ) {
            chrome.downloads.search({ id: downloadId }, function (items) {
              if (items && items.length > 0) {
                const downloadItem = items[0];
                const fileSize = downloadItem.fileSize || downloadItem.totalBytes;
                if (fileSize < minFileSize) {
                  chrome.downloads.removeFile(downloadId, function () {
                    if (chrome.runtime.lastError) {
                      console.error(`Failed to remove file: ${chrome.runtime.lastError}`);
                    } else {
                      console.log(
                        `Removed file ${filename} (size ${fileSize} bytes) because it is smaller than the minimum size (${minFileSize} bytes).`
                      );
                    }
                    chrome.downloads.erase({ id: downloadId });
                  });
                } else {
                  if (keepTrack) {
                    downloadedFilenames.add(filename);
                    downloadedUrls.add(item.url);
                    chrome.storage.local.set({
                      downloadedFilenames: Array.from(downloadedFilenames),
                      downloadedUrls: Array.from(downloadedUrls),
                    });
                  }
                }
              }
            });

            chrome.downloads.onChanged.removeListener(onChanged);
            activeDownloads--;
            processQueue();

            if (item.revokeUrl) {
              URL.revokeObjectURL(item.url);
            }
          } else if (
            delta.id === downloadId &&
            delta.state &&
            delta.state.current === 'interrupted'
          ) {
            console.error(`Download interrupted for ${filename}`);
            chrome.downloads.onChanged.removeListener(onChanged);
            activeDownloads--;
            processQueue();

            if (item.revokeUrl) {
              URL.revokeObjectURL(item.url);
            }
          }
        }

        chrome.downloads.onChanged.addListener(onChanged);
      }
    }
  );
}

// Function to sanitize filenames
function sanitizeFilename(filename) {
  return filename.replace(/[\\/:*?"<>|]/g, '_');
}

// Modify webRequest listener to capture media requests
chrome.webRequest.onCompleted.addListener(
  (details) => {
    const url = details.url;

    // Check if the request is for an image or video
    if (details.type === 'image' || details.type === 'media') {
      if (keepTrack && downloadedUrls.has(url)) {
        console.log(`Skipping download for URL ${url} as it has already been downloaded.`);
        return;
      }

      // Extract filename from URL
      let filename;
      try {
        const urlObj = new URL(url);
        filename = urlObj.pathname.split('/').pop() || 'unnamed';
      } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return;
      }

      filename = sanitizeFilename(filename);

      // Add the download request to the queue
      downloadQueue.push({ url: url, filename: filename });
      processQueue();
    }
  },
  { urls: ['<all_urls>'] },
  []
);

options.js

document.addEventListener('DOMContentLoaded', () => {
  const threadsInput = document.getElementById('threads');
  const keepTrackCheckbox = document.getElementById('keepTrack');
  const minFileSizeInput = document.getElementById('minFileSize');
  const saveButton = document.getElementById('save');
  const resetButton = document.getElementById('reset');
  const clearHistoryButton = document.getElementById('clearHistory');

  // Load saved settings
  chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
    threadsInput.value = result.threads || 10;
    keepTrackCheckbox.checked = result.keepTrack !== false;
    minFileSizeInput.value = result.minFileSize || 50;
  });

  // Save settings
  saveButton.addEventListener('click', () => {
    const threads = parseInt(threadsInput.value) || 10;
    const keepTrack = keepTrackCheckbox.checked;
    const minFileSize = parseInt(minFileSizeInput.value) || 50;

    chrome.storage.local.set(
      { threads: threads, keepTrack: keepTrack, minFileSize: minFileSize },
      () => {
        alert('Settings saved.');
        chrome.runtime.sendMessage({ type: 'updateSettings' });
      }
    );
  });

  // Reset download queue
  resetButton.addEventListener('click', () => {
    chrome.runtime.sendMessage({ type: 'resetQueue' }, (response) => {
      if (response.status === 'success') {
        alert('Download queue reset.');
      } else {
        alert('Failed to reset queue.');
      }
    });
  });

  // Clear download history
  clearHistoryButton.addEventListener('click', () => {
    chrome.runtime.sendMessage({ type: 'clearHistory' }, (response) => {
      if (response.status === 'success') {
        alert('Downloaded filenames history cleared.');
      } else {
        alert('Failed to clear history.');
      }
    });
  });
});

options.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>nGeneAutomaticDownloader Options</title>
  <style>
    body { font-family: Arial, sans-serif; padding: 10px; }
    label { display: block; margin-bottom: 5px; }
    input[type="number"] { width: 50px; }
    button { margin-top: 10px; margin-right: 10px; }
  </style>
</head>
<body>
  <h2>Extension Settings</h2>

  <label>
    Number of Concurrent Downloads:
    <input type="number" id="threads" min="1" max="10">
  </label>

  <label>
    Keep track of downloaded filenames:
    <input type="checkbox" id="keepTrack">
  </label>

  <label>
    Minimum File Size (KB):
    <input type="number" id="minFileSize" min="0" step="1">
  </label>

  <button id="save">Save Settings</button>
  <button id="reset">Reset Download Queue</button>
  <button id="clearHistory">Clear Downloaded Filenames</button>

  <script src="options.js"></script>
</body>
</html>

Feature Implementation

1. Manifest Configuration (manifest.json)

The manifest.json file serves as the blueprint for the Firefox extension, delineating its metadata, permissions, and the scripts it employs.

Permissions: The extension requests access to downloads, tabs, <all_urls>, storage, webRequest, and webRequestBlocking. These permissions enable the extension to monitor and interact with web content, manage downloads, and store user settings.
Background Script: The background.js script is specified under the background property, functioning as the extension's core controller for managing downloads and maintaining state.
Content Scripts: The content.js script is injected into all webpages (<all_urls>) except for about:* and resource://*/* URLs. It operates at the document_idle stage, ensuring that the DOM is fully loaded before execution.
Browser Action and Options UI: The extension includes a browser action with an associated popup (options.html) and an icon. The options_ui section configures the settings page, allowing users to adjust preferences without opening a new tab.

2. Content Script Functionality (content.js)

The content.js script is responsible for identifying and extracting media elements (images and videos) from the webpages the user visits. Its implementation encompasses several key features:

Media Detection: The script scans the DOM for various media-related elements, including <img>, <video>, <audio>, <source>, <picture>, <object>, <embed>, and <canvas>. It also inspects CSS styles for background images and pseudo-elements (::before, ::after).
URL Extraction: For each detected media element, the script extracts relevant URLs from attributes such as src, data-src, srcset, and inline styles. It ensures that only unique and valid URLs are processed to prevent duplicate downloads.
Canvas Handling: The script attempts to extract images from <canvas> elements by converting their content to data URLs. If successful, it sends these data URLs to the background script for downloading.
Mutation Observation: Utilizing a MutationObserver, the script monitors the DOM for changes, such as the addition of new media elements or modifications to existing ones. This ensures that dynamically loaded content is also captured and processed.
Periodic Processing: To account for any elements that may have been missed by the observer, the script periodically re-processes the entire document at specified intervals.
Communication with Background Script: Upon identifying media URLs, the script communicates with the background.js script via messages, requesting downloads and passing necessary information like URLs and filenames.

3. Background Script Operations (background.js)

The background.js script orchestrates the downloading process, managing download queues, threads, and tracking mechanisms.

Download Queue Management: The script maintains a downloadQueue array to hold pending download requests. It processes this queue based on the number of active downloads and the user-defined maximum concurrent downloads (maxConcurrentDownloads).
Thread Control: The script tracks the number of active downloads using the activeDownloads variable. It ensures that no more than the specified number of concurrent downloads are active at any given time, thereby managing threading effectively.
Settings Integration: User preferences, such as the number of concurrent threads, whether to keep track of downloaded files, and the minimum file size for downloads, are retrieved from chrome.storage.local. These settings influence how the script processes and prioritizes download tasks.
Download Tracking: To prevent redundant downloads, the script maintains two Set objects: downloadedFilenames and downloadedUrls. These sets store the names and URLs of files that have already been downloaded, ensuring that the same file is not downloaded multiple times.
Download Initiation and Monitoring: The script initiates downloads using the chrome.downloads.download API. It listens for changes in the download state to determine when a download is complete or interrupted. Upon completion, it verifies the file size against the minimum threshold and removes files that do not meet this criterion. Successfully downloaded files are recorded in the tracking sets.
Data URL Handling: For media elements represented as data URLs (e.g., images from <canvas> elements), the script fetches the blob data and converts it into a downloadable URL. These URLs are then added to the download queue for processing.
Web Request Listener: The script employs a webRequest.onCompleted listener to capture media requests directly from network traffic. This complements the DOM-based detection in content.js, ensuring comprehensive media coverage.

4. User Interface and Settings Management (options.js and options.html)

The options.html and options.js files collectively provide a user interface for configuring the extension's settings.

Settings Fields:
- Number of Concurrent Downloads: Users can specify the maximum number of simultaneous downloads, allowing for control over bandwidth usage and system resources.
- Download Tracking: A checkbox enables users to toggle the tracking of downloaded filenames, aiding in preventing duplicate downloads.
- Minimum File Size: Users can set a threshold for the minimum file size (in KB) to be downloaded, ensuring that only substantial media files are saved.
Interactivity:
- Save Settings: Upon adjusting settings, users can save their preferences, which are then stored in chrome.storage.local. The background script is notified to update its configuration accordingly.
- Reset Download Queue: This feature allows users to clear the current download queue, halting any pending download operations.
- Clear Download History: Users can erase the history of downloaded filenames and URLs, resetting the extension's tracking mechanism.
Persistence: The settings are loaded upon the options page's initialization, ensuring that user preferences persist across sessions.

5. Thread Management and Concurrency Control

Thread management is pivotal in handling multiple downloads efficiently without overwhelming system resources.

Initialization: Upon startup, the background script retrieves user-defined settings, including the number of concurrent threads (maxConcurrentDownloads), from storage.
Processing Logic: The processQueue function oversees the download queue, initiating downloads as long as the number of active downloads is below the specified limit. This function is invoked whenever a new download is added to the queue or when an active download completes.
Download Lifecycle: Each download task increments the activeDownloads count upon initiation. Listeners monitor the download's progress, decrementing the count and triggering the processing of subsequent queued downloads upon completion or interruption.

6. Download Tracking Mechanism

To avoid redundant downloads and optimize performance, the extension implements a robust tracking system.

Sets for Tracking: Two Set objects, downloadedFilenames and downloadedUrls, store the names and URLs of files that have been successfully downloaded.
Condition Checks: Before initiating a download, the script checks whether the URL exists in the downloadedUrls set. If it does, the download is skipped to prevent duplication.
Updating Tracking Data: Upon successful completion of a download that meets the minimum file size requirement, the filename and URL are added to their respective sets. These sets are then persisted to storage, ensuring that the tracking data remains consistent across sessions.
Clearing History: Users can reset the tracking data via the options interface, allowing for fresh download sessions without residual tracking information.

7. File Type Handling

The extension is tailored to download specific media file types, primarily images and videos.

Detection: Both the content script (content.js) and the background script (background.js) focus on media elements and web requests associated with images (<img>, background images) and videos (<video>, <audio>, <source>, etc.).
Data URL Processing: For media represented as data URLs, such as images drawn on <canvas> elements, the content script extracts the data and forwards it to the background script for conversion and downloading.
File Validation: Post-download, the background script assesses the file size to ensure it meets the user-defined minimum threshold. Files falling below this size are automatically removed, maintaining the integrity of the download collection.

8. User Interface Design

The extension's user interface is designed for simplicity and ease of use.

Layout and Styling: The options.html file employs basic CSS to structure the settings form, ensuring that users can intuitively navigate and adjust preferences.
Feedback Mechanisms: Actions such as saving settings, resetting the download queue, and clearing history are accompanied by alert messages, providing immediate feedback to the user.
Accessibility: Input fields are appropriately labeled, and buttons are clearly named to facilitate straightforward interaction.

Written on November 30th, 2024