Project nGene.org is an advanced academic software designed to facilitate programming and research in the field of hemodynamics, integrating computational modeling, simulation, medical statistics, and machine learning. As part of its multifaceted approach, Project nGene.org employs web crawling (web scraping) to aggregate and analyze vast amounts of biomedical data from various online sources. This section delineates the ethical framework guiding the web crawling activities within Project nGene.org, ensuring that data collection practices align with legal standards and the project's academic integrity.
User-Agent
strings that include contact information, the project ensures that its presence is identifiable and accountable. This transparency allows website administrators and stakeholders to recognize the source of data requests and facilitates open communication channels.User-Agent
string that includes contact information, Project nGene.org ensures transparency in its web crawling operations.Project nGene.org has developed three distinct web crawler prototypes, each utilizing different programming languages and methodologies. These prototypes serve as foundational tools for automated data collection and analysis, essential for advancing research objectives. This analysis delineates the programming characteristics, advantages, and limitations of each version, providing a comprehensive understanding of their operational dynamics. It is important to note that these implementations are in the prototype stage, primarily designed for testing and evaluation purposes.
Project nGene.org has developed a prototype of a JavaScript-based web crawler and image downloader intended to automate the collection and analysis of web-based biomedical data. This client-side crawler operates within a web browser, enabling the input of a target website URL, specification of the depth of recursion, selection of specific HTML tags to search for, and the decision to limit crawling to the same domain. The following outlines the functionality of this prototype, the challenges encountered—particularly regarding Cross-Origin Resource Sharing (CORS) policies—and its inherent limitations, along with potential strategies to overcome these obstacles.
<a>
).<img>
tags), constructs absolute URLs for these images, and initiates downloads.While the JavaScript-based crawler offers a convenient and accessible means of data collection directly from the browser, several significant challenges are encountered:
CORS is a security feature implemented by web browsers to restrict web pages from making requests to a different domain than the one that served the web page. This ensures that malicious websites cannot access sensitive data from other sites without permission.
The same-origin policy is a security measure that allows scripts running on a web page to interact only with resources from the same origin (i.e., same domain, protocol, and port). This restricts the crawler from accessing and processing content from external websites unless they are within the same domain or have been configured to allow such interactions.
fetch
API and does not execute or render JavaScript.The enforcement of CORS policies presents a significant barrier to the crawler's effectiveness:
While CORS policies and other limitations present significant challenges, several strategies can mitigate these issues:
async function crawl(url, depth, tag, sameDomainOnly, visited = new Set(), failed = new Set(), baseDomain = null) {
if (stopCrawling || depth < 0 || visited.has(url) || failed.has(url)) return;
visited.add(url);
// ... existing code ...
// Introduce a delay between requests
await new Promise(resolve => setTimeout(resolve, 1000)); // 1-second delay
// ... continue crawling ...
}
The JavaScript-based web crawler and image downloader prototype integrated into Project nGene.org offers a user-friendly interface for automated data collection directly within the browser. However, significant challenges related to browser security policies, particularly CORS, and inherent limitations in handling dynamic content and maintaining performance are encountered. By adopting strategies such as using CORS proxies, shifting to server-side crawling, leveraging headless browsers, and implementing robust rate limiting, these limitations can be effectively mitigated. These enhancements will enable the prototype to perform more comprehensive and efficient data collection, thereby supporting the mission to advance hemodynamic research through accurate and extensive biomedical data aggregation.
The ability to automatically download images and videos from webpages can enhance productivity and user experience. Implementing this functionality in Firefox can be approached in two primary ways: modifying Firefox's source code or developing a browser extension. This document provides an integrated overview of these methods, focusing on the creation of a Firefox extension due to its practicality and ease of maintenance.
Modifying the Firefox source code involves directly editing the browser's internal components to include the desired functionality. While this approach offers deep integration and control, it presents significant challenges:
Creating a Firefox extension, specifically a WebExtension, is a more practical solution. Extensions are easier to develop, maintain, and distribute. They operate within the browser's existing framework, providing the desired functionality without altering the core code.
Firefox extensions utilize standard web technologies, making development accessible:
manifest.json
file.about:debugging
.about:debugging#/runtime/this-firefox
manifest.json
: Defines metadata, permissions, and scripts.The extension operates by:
The extension can download content from websites requiring authentication because:
Possible limitations include:
Websites might detect automated downloading through:
robots.txt
guidelines.Websites like YouTube, Netflix, and other streaming services employ DRM technologies that prevent the downloading of their content. The extension:
Users should be mindful of:
Written on November 29th, 2024
Automated downloading and web scraping can inadvertently trigger detection mechanisms on websites, potentially resulting in blocks or other restrictions. Implementing best practices helps minimize the risk of detection while ensuring responsible and ethical use of automated tools within Firefox extensions. The strategies outlined below provide guidance on emulating human-like behavior, respecting website policies, and preventing server overload.
Introducing delays between download requests is essential for mimicking human behavior. Randomized delays make automated activities less distinguishable from those of regular users.
function startDownloadWithDelay(item, delay) {
setTimeout(() => {
startDownload(item);
}, delay);
}
// Use a random delay between 1 to 3 seconds
const randomDelay = Math.random() * 2000 + 1000; // 1000 to 3000 ms
startDownloadWithDelay(item, randomDelay);
In this example, startDownloadWithDelay
introduces a delay before initiating the download. The delay is randomized between 1 to 3 seconds to prevent patterns that might be detected by automated systems.
Focusing on downloading only visible and relevant media reduces the volume of requests and aligns with typical user behavior.
function isElementInViewport(el) {
const rect = el.getBoundingClientRect();
return (
rect.top >= 0 &&
rect.left >= 0 &&
rect.bottom <= (window.innerHeight || document.documentElement.clientHeight) &&
rect.right <= (window.innerWidth || document.documentElement.clientWidth)
);
}
The isElementInViewport
function determines if a media element is within the visible area of the webpage. By downloading only these elements, the automation mimics typical user interaction with the page.
Adhering to a website's policies and guidelines is essential for ethical automation practices. The robots.txt
file provides directives on how automated agents should interact with the site.
robots.txt
: Access the robots.txt
file to understand the allowed and disallowed paths for automated agents.Disallow
.robots.txt
example.com
with the target domain.robots.txt
ContentUser-agent: *
Disallow: /private/
In this example, all user agents are instructed not to access the /private/
directory. Automated tools should respect this directive to comply with the website's policies.
Maintaining standard request headers helps prevent anomalies that might trigger detection systems. Custom headers or unusual values can raise red flags.
Referer
or User-Agent
unless necessary for functionality.By adhering to standard header configurations, automated requests appear more like those from regular users, reducing the likelihood of detection.
Excessive simultaneous downloads can strain server resources and negatively impact website performance. Limiting concurrency ensures responsible use of resources.
let activeDownloads = 0;
const maxConcurrentDownloads = 5;
const downloadQueue = [];
function processQueue() {
if (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
const item = downloadQueue.shift();
activeDownloads++;
startDownload(item, () => {
activeDownloads--;
processQueue();
});
}
}
// Add items to the queue and start processing
downloadQueue.push(...itemsToDownload);
processQueue();
In this code, processQueue
manages the download queue by ensuring that no more than five downloads occur at the same time. The startDownload
function includes the logic for downloading the item and invokes a callback upon completion.
Written on November 30th, 2024
This document provides a comprehensive explanation of the five scripts used in the nGeneAutomaticDownloader Firefox extension. Each section includes the full script with detailed comments and an explanation of how functions and features are implemented to facilitate easier understanding and maintenance.
manifest.json
{
"manifest_version": 2, // Specifies the version of the manifest file format
"name": "nGeneAutomaticDownloader", // The name of the extension
"version": "1.5", // The version of the extension
"description": "Automatically downloads all images and videos from webpages to a 'firefox' folder in your default download directory.", // A brief description
"permissions": [
"downloads", // Allows use of the downloads API to download files
"tabs", // Grants access to browser tabs
"<all_urls>", // Allows access to all URLs
"storage", // Permits storage and retrieval of data using chrome.storage API
"webRequest", // Enables observation and analysis of web requests
"webRequestBlocking" // Allows modification or blocking of web requests
],
"background": {
"scripts": ["background.js"] // Specifies the background script
},
"content_scripts": [
{
"matches": ["<all_urls>"], // The content script will be injected into all pages
"exclude_matches": ["about:*", "resource://*/*"], // Excludes internal browser pages
"js": ["content.js"], // The content script file
"run_at": "document_idle" // Injects the script after the page has loaded
}
],
"browser_action": {
"default_title": "nGeneAutomaticDownloader", // Tooltip text for the browser action icon
"default_popup": "options.html", // HTML file displayed when the icon is clicked
"default_icon": {
"48": "icons/download-icon.png" // Icon for the browser action
}
},
"options_ui": {
"page": "options.html", // Options page for the extension
"open_in_tab": false // Opens the options page as a popup
},
"icons": {
"48": "icons/download-icon.png" // The extension's icon
}
}
The manifest.json
file is the configuration file for the Firefox extension. It defines essential metadata and specifies the extension's behavior.
downloads
: Allows the extension to download files using the downloads API.tabs
: Grants access to browser tabs.<all_urls>
: Permits the extension to access all URLs.storage
: Enables data storage and retrieval using the chrome.storage
API.webRequest
: Allows the extension to observe and analyze web requests.webRequestBlocking
: Permits the extension to modify or block web requests.background.js
) that runs in the background context.matches
: Defines URLs where the content script (content.js
) will be injected. Here, it matches all URLs.exclude_matches
: Excludes specific internal browser pages from script injection.run_at
: Sets the injection timing to after the page has loaded.default_title
: Tooltip text for the browser action icon.default_popup
: The HTML file (options.html
) displayed when the icon is clicked.default_icon
: The icon for the browser action.page
: The options page (options.html
) for the extension.open_in_tab
: Determines whether the options page opens in a new tab or as a popup.content.js
(function () {
// Set to keep track of processed media URLs to prevent duplicates
const processedMediaUrls = new Set();
// Main function to process media elements starting from a root node
function processMediaElements(rootNode) {
const mediaUrls = []; // Array to collect media URLs found
// If the root node is not an element or the document itself, exit
if (rootNode.nodeType !== Node.ELEMENT_NODE && rootNode !== document) {
return;
}
// Nodes to process; start with the root node
const nodes = rootNode === document ? [document] : [rootNode];
// Iterate over each node to collect media URLs
nodes.forEach((node) => {
// Collect images from <img> tags
node.querySelectorAll('img').forEach((img) => {
collectImageFromElement(img, mediaUrls);
});
// Collect images from <picture> elements
node.querySelectorAll('picture source').forEach((source) => {
collectSrcsetUrls(source, mediaUrls);
});
// Collect videos and their source elements
node.querySelectorAll('video, source').forEach((element) => {
collectVideoFromElement(element, mediaUrls);
});
// Collect images from <object> and <embed> tags
node.querySelectorAll('object, embed').forEach((element) => {
collectObjectEmbedMedia(element, mediaUrls);
});
// Collect background images from CSS stylesheets
collectBackgroundImages(mediaUrls);
// Collect images from inline styles
collectInlineStyles(mediaUrls);
// Collect images from <canvas> elements
node.querySelectorAll('canvas').forEach((canvas) => {
collectCanvasImage(canvas);
});
// Collect images from pseudo-elements (::before and ::after)
collectPseudoElementImages(mediaUrls);
});
// Process the collected media URLs
processMediaUrls(mediaUrls);
}
// Collects image URLs from <img> elements
function collectImageFromElement(img, mediaUrls) {
let url = img.src || img.currentSrc; // Get the image source URL
if (!url) {
// Check for lazy-loaded images
url = img.getAttribute('data-src') || img.getAttribute('data-lazy-src');
}
if (url) {
mediaUrls.push(url); // Add the URL to the list
}
// Handle srcset attribute for responsive images
const srcset = img.getAttribute('srcset');
if (srcset) {
const srcsetUrls = srcset
.split(',')
.map((entry) => entry.trim().split(' ')[0]);
srcsetUrls.forEach((srcsetUrl) => {
if (srcsetUrl) {
mediaUrls.push(srcsetUrl);
}
});
}
}
// Collects image URLs from <source> elements in <picture> tags
function collectSrcsetUrls(element, mediaUrls) {
const srcset = element.getAttribute('srcset');
if (srcset) {
const srcsetUrls = srcset
.split(',')
.map((entry) => entry.trim().split(' ')[0]);
srcsetUrls.forEach((srcsetUrl) => {
if (srcsetUrl) {
mediaUrls.push(srcsetUrl);
}
});
}
// Check for 'src' attribute
const src = element.getAttribute('src');
if (src) {
mediaUrls.push(src);
}
}
// Extracts images from <canvas> elements
function collectCanvasImage(canvas) {
try {
// Convert the canvas content to a data URL
const dataURL = canvas.toDataURL();
if (dataURL && !processedMediaUrls.has(dataURL)) {
processedMediaUrls.add(dataURL); // Mark as processed
// Send a message to download the data URL
chrome.runtime.sendMessage(
{ type: 'downloadDataUrl', dataUrl: dataURL },
function (response) {
if (chrome.runtime.lastError) {
console.error(
`Error sending canvas image: ${chrome.runtime.lastError}`
);
}
}
);
}
} catch (e) {
console.error('Failed to extract image from canvas:', e);
}
}
// Collects video URLs from <video> and <source> elements
function collectVideoFromElement(element, mediaUrls) {
if (element.tagName.toLowerCase() === 'video') {
let url = element.src || element.currentSrc; // Get the video source URL
if (!url) {
url = element.getAttribute('data-src');
}
if (url) {
mediaUrls.push(url);
}
// Process <source> elements within the <video>
element.querySelectorAll('source').forEach((sourceElement) => {
const sourceUrl =
sourceElement.src ||
sourceElement.getAttribute('src') ||
sourceElement.getAttribute('data-src');
if (sourceUrl) {
mediaUrls.push(sourceUrl);
}
});
// Check for 'poster' attribute
const posterUrl = element.getAttribute('poster');
if (posterUrl) {
mediaUrls.push(posterUrl);
}
} else if (element.tagName.toLowerCase() === 'source') {
// For <source> elements outside of <video>
const sourceUrl =
element.src ||
element.getAttribute('src') ||
element.getAttribute('data-src');
if (sourceUrl) {
mediaUrls.push(sourceUrl);
}
}
}
// Collects media URLs from <object> and <embed> elements
function collectObjectEmbedMedia(element, mediaUrls) {
const url = element.data || element.getAttribute('data');
if (url) {
mediaUrls.push(url);
}
}
// Collects background images from CSS stylesheets
function collectBackgroundImages(mediaUrls) {
for (const sheet of document.styleSheets) {
let rules;
try {
rules = sheet.cssRules; // Get CSS rules
} catch (e) {
// Skip cross-origin stylesheets
continue;
}
if (!rules) continue;
for (const rule of rules) {
if (rule.type === CSSRule.STYLE_RULE) {
const style = rule.style;
const bgImage =
style.getPropertyValue('background-image') ||
style.getPropertyValue('background');
extractUrlsFromStyle(bgImage, mediaUrls);
} else if (rule.type === CSSRule.MEDIA_RULE) {
// Handle @media rules
for (const mediaRule of rule.cssRules) {
if (mediaRule.type === CSSRule.STYLE_RULE) {
const style = mediaRule.style;
const bgImage =
style.getPropertyValue('background-image') ||
style.getPropertyValue('background');
extractUrlsFromStyle(bgImage, mediaUrls);
}
}
}
}
}
}
// Collects background images from inline styles
function collectInlineStyles(mediaUrls) {
document.querySelectorAll('*[style]').forEach((element) => {
const style = element.getAttribute('style');
extractUrlsFromStyle(style, mediaUrls);
});
}
// Collects images from pseudo-elements (::before and ::after)
function collectPseudoElementImages(mediaUrls) {
document.querySelectorAll('*').forEach((element) => {
['::before', '::after'].forEach((pseudo) => {
const style = getComputedStyle(element, pseudo);
const bgImage =
style.getPropertyValue('background-image') ||
style.getPropertyValue('background');
extractUrlsFromStyle(bgImage, mediaUrls);
});
});
}
// Extracts URLs from CSS style properties
function extractUrlsFromStyle(styleValue, mediaUrls) {
if (styleValue && styleValue !== 'none') {
// Match URLs in the style value
const urls = styleValue.match(/url\(["']?([^"')]+)["']?\)/g);
if (urls) {
urls.forEach((urlString) => {
const url = urlString.match(/url\(["']?([^"')]+)["']?\)/)[1];
if (url) {
// Resolve relative URLs to absolute URLs
const absoluteUrl = new URL(url, location.href).href;
mediaUrls.push(absoluteUrl);
}
});
}
}
}
// Processes the collected media URLs
function processMediaUrls(mediaUrls) {
// Remove duplicates and already processed URLs
const uniqueUrls = Array.from(new Set(mediaUrls));
uniqueUrls.forEach((url) => {
const cleanUrl = url.split('#')[0]; // Remove fragment identifiers
if (processedMediaUrls.has(cleanUrl)) {
return; // Skip already processed URLs
}
processedMediaUrls.add(cleanUrl); // Mark as processed
// Handle data URLs
if (url.startsWith('data:')) {
// Send a message to download the data URL
chrome.runtime.sendMessage(
{ type: 'downloadDataUrl', dataUrl: url },
function (response) {
if (chrome.runtime.lastError) {
console.error(
`Error sending data URL: ${chrome.runtime.lastError}`
);
}
}
);
return;
}
let filename;
try {
const urlObj = new URL(url, location.href); // Create a URL object
filename = urlObj.pathname.split('/').pop(); // Extract the filename
if (!filename || filename.length === 0) {
filename = 'unnamed'; // Default filename
}
// Try to get file extension
let extension = filename.includes('.') ? filename.split('.').pop() : '';
if (!extension) {
// Guess extension from MIME type
const mimeType = urlObj.searchParams.get('type') || '';
if (mimeType) {
extension = mimeType.split('/').pop();
}
}
if (extension) {
filename += '.' + extension;
}
} catch (e) {
console.error(`Invalid URL: ${url}`);
return;
}
// Send a message to download the file
chrome.runtime.sendMessage(
{ type: 'download', url: url, filename: filename },
function (response) {
if (chrome.runtime.lastError) {
console.error(
`Error sending message for ${filename}: ${chrome.runtime.lastError}`
);
}
}
);
});
}
// Observes changes in the DOM to detect new media elements
const observer = new MutationObserver((mutations) => {
mutations.forEach((mutation) => {
if (mutation.type === 'childList') {
// If nodes are added to the DOM
mutation.addedNodes.forEach((node) => {
if (node.nodeType === Node.ELEMENT_NODE) {
processMediaElements(node); // Process the new node
// Also process media elements within this node
node
.querySelectorAll(
'img, video, source, picture source, object, embed, canvas'
)
.forEach((element) => {
processMediaElements(element);
});
}
});
} else if (mutation.type === 'attributes') {
// If attributes of an element have changed
if (mutation.target && mutation.target.nodeType === Node.ELEMENT_NODE) {
// Check if the changed attribute is relevant
const relevantAttributes = [
'src',
'srcset',
'style',
'data-src',
'data-lazy-src',
'poster',
'data',
'href',
];
if (relevantAttributes.includes(mutation.attributeName)) {
processMediaElements(mutation.target); // Process the element
}
}
}
});
});
// Start observing the document for changes
observer.observe(document, {
childList: true, // Observe when nodes are added or removed
subtree: true, // Observe all descendant nodes
attributes: true, // Observe attribute changes
attributeFilter: [
'src',
'srcset',
'style',
'data-src',
'data-lazy-src',
'poster',
'data',
'href',
], // Attributes to observe
});
// Listen for user interactions to trigger processing
['click', 'scroll', 'mousemove', 'touchstart', 'touchmove'].forEach(
(event) => {
document.addEventListener(event, () => {
processMediaElements(document);
});
}
);
// Initial processing when the window loads
window.addEventListener('load', () => {
processMediaElements(document);
});
})();
The content.js
script is a content script that runs in the context of web pages. Its primary purpose is to identify and collect all media elements (images, videos, etc.) on a web page and send messages to the background script to download these media files.
processedMediaUrls
: A Set
used to keep track of media URLs that have already been processed, preventing duplicate downloads.processMediaElements(rootNode)
: The main function that processes media elements starting from a given root node. It collects media URLs from various elements such as <img>
, <picture>
, <video>
, <source>
, <object>
, <embed>
, <canvas>
, and also from CSS styles and pseudo-elements.collectImageFromElement(img, mediaUrls)
: Collects image URLs from <img>
elements, handling src
, currentSrc
, data-src
, data-lazy-src
, and srcset
attributes.collectSrcsetUrls(element, mediaUrls)
: Collects image URLs from <source>
elements within <picture>
tags, processing srcset
and src
attributes.collectVideoFromElement(element, mediaUrls)
: Collects video URLs from <video>
and <source>
elements, handling src
, currentSrc
, data-src
, and poster
attributes.collectObjectEmbedMedia(element, mediaUrls)
: Collects media URLs from <object>
and <embed>
elements.collectCanvasImage(canvas)
: Extracts images from <canvas>
elements by converting the canvas content to a data URL.collectBackgroundImages(mediaUrls)
: Collects background images specified in CSS stylesheets.collectInlineStyles(mediaUrls)
: Collects background images from inline style
attributes.collectPseudoElementImages(mediaUrls)
: Collects images from CSS pseudo-elements ::before
and ::after
.extractUrlsFromStyle(styleValue, mediaUrls)
: A helper function that extracts URLs from CSS style properties using regular expressions.processMediaUrls(mediaUrls)
: Processes the collected media URLs by removing duplicates and already processed URLs. It handles data URLs and regular URLs differently:
observer
: Observes changes in the DOM, such as the addition of new nodes or changes to attributes, to detect dynamically loaded media.click
, scroll
, mousemove
, touchstart
, and touchmove
to trigger reprocessing of the page, capturing media that loads upon user interaction.background.js
// Variables to manage downloads and settings
let downloadQueue = []; // Queue for download requests
let activeDownloads = 0; // Number of active downloads
let maxConcurrentDownloads = 10; // Default maximum concurrent downloads
let keepTrack = true; // Whether to keep track of downloaded filenames
let minFileSize = 50 * 1024; // Minimum file size in bytes (default 50 KB)
let downloadedFilenames = new Set(); // Set to store filenames of downloaded files
let downloadedUrls = new Set(); // Set to store URLs of downloaded files
// Load initial settings from chrome.storage.local
chrome.storage.local.get(
[
'threads',
'keepTrack',
'minFileSize',
'downloadedFilenames',
'downloadedUrls',
],
(result) => {
// Update variables with saved settings or use defaults
maxConcurrentDownloads = result.threads || 10;
keepTrack = result.keepTrack !== false; // Default to true if undefined
minFileSize = (result.minFileSize || 50) * 1024; // Convert KB to bytes
if (result.downloadedFilenames && Array.isArray(result.downloadedFilenames)) {
downloadedFilenames = new Set(result.downloadedFilenames); // Initialize set with saved filenames
}
if (result.downloadedUrls && Array.isArray(result.downloadedUrls)) {
downloadedUrls = new Set(result.downloadedUrls); // Initialize set with saved URLs
}
}
);
// Listener for messages from other parts of the extension
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
if (message.type === 'updateSettings') {
// Update settings when they are changed in options
chrome.storage.local.get(
['threads', 'keepTrack', 'minFileSize'],
(result) => {
maxConcurrentDownloads = result.threads || 10;
keepTrack = result.keepTrack !== false;
minFileSize = (result.minFileSize || 50) * 1024;
console.log(
`Updated settings: maxConcurrentDownloads = ${maxConcurrentDownloads}, keepTrack = ${keepTrack}, minFileSize = ${minFileSize} bytes`
);
}
);
} else if (message.type === 'resetQueue') {
// Reset the download queue
downloadQueue = [];
sendResponse({ status: 'success' });
} else if (message.type === 'clearHistory') {
// Clear the set of downloaded filenames and URLs
downloadedFilenames.clear();
downloadedUrls.clear();
chrome.storage.local.set(
{ downloadedFilenames: [], downloadedUrls: [] },
() => {
sendResponse({ status: 'success' });
}
);
return true; // Keep the message channel open for sendResponse
} else if (message.type === 'download') {
// Handle download request from content script
const url = message.url;
const filename = message.filename;
// Check if the file has already been downloaded
if (keepTrack && downloadedUrls.has(url)) {
console.log(
`Skipping download for URL ${url} as it has already been downloaded.`
);
sendResponse({ status: 'skipped' });
return;
}
// Add the download request to the queue
downloadQueue.push({ url: url, filename: filename });
// Start processing the queue
processQueue();
sendResponse({ status: 'queued' });
} else if (message.type === 'downloadDataUrl') {
// Handle download request for data URL
const dataUrl = message.dataUrl;
const filename = `image_${Date.now()}.png`; // Generate a unique filename
// Convert data URL to Blob
fetch(dataUrl)
.then((res) => res.blob())
.then((blob) => {
const url = URL.createObjectURL(blob);
downloadQueue.push({ url: url, filename: filename, revokeUrl: true });
processQueue();
sendResponse({ status: 'queued' });
})
.catch((error) => {
console.error(`Failed to download data URL: ${error}`);
sendResponse({ status: 'error', error: error.toString() });
});
return true; // Keep the message channel open for sendResponse
}
});
// Function to process the download queue
function processQueue() {
// Continue processing while there are slots for active downloads
while (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
const item = downloadQueue.shift(); // Get the next item from the queue
startDownload(item); // Start the download
}
}
// Function to start a download
function startDownload(item) {
activeDownloads++; // Increment the number of active downloads
const filename = item.filename || item.url.split('/').pop();
const fullFilename = `firefox/${sanitizeFilename(filename)}`; // Prepend 'firefox/' to create a subdirectory
chrome.downloads.download(
{
url: item.url,
filename: fullFilename,
saveAs: false, // Do not prompt the user for save location
conflictAction: 'overwrite', // Overwrite existing files
},
(downloadId) => {
if (chrome.runtime.lastError) {
// Handle errors during download initiation
console.error(
`Download failed for ${filename}: ${chrome.runtime.lastError}`
);
activeDownloads--; // Decrement active downloads
processQueue(); // Try the next item in the queue
} else {
console.log(
`Download started: ID = ${downloadId}, filename = ${filename}`
);
// Listener for changes in the download state
function onChanged(delta) {
if (
delta.id === downloadId &&
delta.state &&
delta.state.current === 'complete'
) {
// Download completed successfully
chrome.downloads.search({ id: downloadId }, function (items) {
if (items && items.length > 0) {
const downloadItem = items[0];
const fileSize =
downloadItem.fileSize || downloadItem.totalBytes; // Get the file size
if (fileSize < minFileSize) {
// File is smaller than the minimum size
chrome.downloads.removeFile(downloadId, function () {
if (chrome.runtime.lastError) {
console.error(
`Failed to remove file: ${chrome.runtime.lastError}`
);
} else {
console.log(
`Removed file ${filename} (size ${fileSize} bytes) because it is smaller than the minimum size (${minFileSize} bytes).`
);
}
// Remove the download from history
chrome.downloads.erase({ id: downloadId });
});
} else {
// File meets the size requirement
if (keepTrack) {
// Add the filename and URL to the sets
downloadedFilenames.add(filename);
downloadedUrls.add(item.url);
// Update the stored sets
chrome.storage.local.set({
downloadedFilenames: Array.from(downloadedFilenames),
downloadedUrls: Array.from(downloadedUrls),
});
}
}
}
});
// Cleanup after download completion
chrome.downloads.onChanged.removeListener(onChanged);
activeDownloads--; // Decrement active downloads
processQueue(); // Process the next item
// Revoke object URL if necessary
if (item.revokeUrl) {
URL.revokeObjectURL(item.url);
}
} else if (
delta.id === downloadId &&
delta.state &&
delta.state.current === 'interrupted'
) {
// Download was interrupted
console.error(`Download interrupted for ${filename}`);
chrome.downloads.onChanged.removeListener(onChanged);
activeDownloads--; // Decrement active downloads
processQueue(); // Process the next item
// Revoke object URL if necessary
if (item.revokeUrl) {
URL.revokeObjectURL(item.url);
}
}
}
// Add the listener to monitor download changes
chrome.downloads.onChanged.addListener(onChanged);
}
}
);
}
// Function to sanitize filenames
function sanitizeFilename(filename) {
return filename.replace(/[\\/:*?"<>|]/g, '_');
}
// Use the webRequest API to monitor network requests
chrome.webRequest.onCompleted.addListener(
(details) => {
// Check if the request is for an image or video
if (details.type === 'image' || details.type === 'media') {
const url = details.url;
// Extract filename from URL
let filename;
try {
const urlObj = new URL(url);
filename = urlObj.pathname.split('/').pop();
if (!filename || filename.length === 0) {
filename = 'unnamed';
}
} catch (e) {
console.error(`Invalid URL: ${url}`);
return;
}
filename = filename.split('?')[0];
// Sanitize filename
filename = sanitizeFilename(filename);
// Check if the file has already been downloaded
if (keepTrack && downloadedUrls.has(url)) {
console.log(
`Skipping download for URL ${url} as it has already been downloaded.`
);
return;
}
// Add the download request to the queue
downloadQueue.push({ url: url, filename: filename });
// Start processing the queue
processQueue();
}
},
{ urls: ['<all_urls>'] },
[]
);
The background.js
script runs in the background context of the extension. It manages the downloading of media files, handles settings, and communicates with the content script and options page.
downloadQueue
: An array serving as a queue for download requests.activeDownloads
: Tracks the number of downloads currently in progress.maxConcurrentDownloads
: Maximum number of concurrent downloads allowed.keepTrack
: Indicates whether to track downloaded filenames and URLs to avoid duplicates.minFileSize
: The minimum file size (in bytes) required for a file to be kept.downloadedFilenames
and downloadedUrls
: Set
s to store filenames and URLs of downloaded files.'updateSettings'
: Reloads settings when changed.'resetQueue'
: Resets the download queue.'clearHistory'
: Clears the history of downloaded filenames and URLs.'download'
: Handles download requests from the content script.'downloadDataUrl'
: Handles download requests for data URLs.processQueue()
: Processes the download queue while there are available slots for active downloads.startDownload(item)
: Initiates the download of an item and sets up listeners to monitor the download state.sanitizeFilename(filename)
: Replaces invalid characters in filenames.webRequest
API to monitor completed web requests and adds relevant media URLs to the download queue.options.js
// Wait until the DOM content is fully loaded
document.addEventListener('DOMContentLoaded', () => {
// References to the HTML elements in options.html
const threadsInput = document.getElementById('threads'); // Input for concurrent downloads
const keepTrackCheckbox = document.getElementById('keepTrack'); // Checkbox for tracking filenames
const minFileSizeInput = document.getElementById('minFileSize'); // Input for minimum file size
const saveButton = document.getElementById('save'); // Button to save settings
const resetButton = document.getElementById('reset'); // Button to reset the download queue
const clearHistoryButton = document.getElementById('clearHistory'); // Button to clear history
// Load saved settings from chrome.storage.local
chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
// Set input values to saved settings or defaults
threadsInput.value = result.threads || 10; // Default to 10 threads
keepTrackCheckbox.checked = result.keepTrack !== false; // Default to true
minFileSizeInput.value = result.minFileSize || 50; // Default to 50 KB
});
// Event listener for the Save Settings button
saveButton.addEventListener('click', () => {
// Retrieve values from the inputs
const threads = parseInt(threadsInput.value) || 10;
const keepTrack = keepTrackCheckbox.checked;
const minFileSize = parseInt(minFileSizeInput.value) || 50;
// Save the settings to chrome.storage.local
chrome.storage.local.set(
{ threads: threads, keepTrack: keepTrack, minFileSize: minFileSize },
() => {
alert('Settings saved.');
// Notify the background script to update settings
chrome.runtime.sendMessage({ type: 'updateSettings' });
}
);
});
// Event listener for the Reset Download Queue button
resetButton.addEventListener('click', () => {
// Send a message to reset the queue
chrome.runtime.sendMessage({ type: 'resetQueue' }, (response) => {
if (response.status === 'success') {
alert('Download queue reset.');
} else {
alert('Failed to reset queue.');
}
});
});
// Event listener for the Clear Downloaded Filenames button
clearHistoryButton.addEventListener('click', () => {
// Send a message to clear the history
chrome.runtime.sendMessage({ type: 'clearHistory' }, (response) => {
if (response.status === 'success') {
alert('Downloaded filenames history cleared.');
} else {
alert('Failed to clear history.');
}
});
});
});
The options.js
script handles the user interface of the extension's options page, allowing users to adjust settings and perform actions.
threadsInput
: Input field for the number of concurrent downloads.keepTrackCheckbox
: Checkbox for enabling/disabling tracking of downloaded filenames.minFileSizeInput
: Input field for the minimum file size.saveButton
: Button to save settings.resetButton
: Button to reset the download queue.clearHistoryButton
: Button to clear the download history.chrome.storage.local
and updates the input fields.options.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>nGeneAutomaticDownloader Options</title>
<style>
/* Basic styling for the options page */
body { font-family: Arial, sans-serif; padding: 10px; }
label { display: block; margin-bottom: 5px; }
input[type="number"] { width: 50px; }
button { margin-top: 10px; margin-right: 10px; }
</style>
</head>
<body>
<h2>Extension Settings</h2>
<!-- Setting for the number of concurrent downloads -->
<label>
Number of Concurrent Downloads:
<input type="number" id="threads" min="1" max="10">
</label>
<!-- Setting to keep track of downloaded filenames -->
<label>
Keep track of downloaded filenames:
<input type="checkbox" id="keepTrack">
</label>
<!-- Setting for the minimum file size threshold -->
<label>
Minimum File Size (KB):
<input type="number" id="minFileSize" min="0" step="1">
</label>
<!-- Buttons to save settings, reset download queue, and clear history -->
<button id="save">Save Settings</button>
<button id="reset">Reset Download Queue</button>
<button id="clearHistory">Clear Downloaded Filenames</button>
<!-- Include the options.js script -->
<script src="options.js"></script>
</body>
</html>
The options.html
file defines the user interface for the extension's options page.
meta charset
: Specifies the character encoding.title
: Sets the page title.style
: Contains basic CSS styling for the page.h2
: "Extension Settings" heading.id="threads"
.id="keepTrack"
.id="minFileSize"
.id="save"
.id="reset"
.id="clearHistory"
.options.js
script for interactivity.Written on November 30th, 2024
The nGeneAutomaticDownloader is a Firefox extension designed to automatically download all images and videos from webpages, organizing them into a designated 'firefox' folder within the user's default download directory. This document provides a comprehensive overview of the extension's implementation across its constituent files. It elucidates the functionalities and features embedded within each script, offering insights into the mechanisms employed for thread management, download tracking, file type handling, and user interface configuration. This overview serves as a reference to facilitate future reviews and modifications of the extension's codebase.
{
"manifest_version": 2,
"name": "nGeneAutomaticDownloader",
"version": "1.6",
"description": "Automatically downloads all images and videos from webpages to a 'firefox' folder in your default download directory.",
"permissions": [
"downloads",
"tabs",
"<all_urls>",
"storage",
"webRequest",
"webRequestBlocking"
],
"background": {
"scripts": ["background.js"]
},
"content_scripts": [
{
"matches": ["<all_urls>"],
"exclude_matches": ["about:*", "resource://*/*"],
"js": ["content.js"],
"run_at": "document_idle"
}
],
"browser_action": {
"default_title": "nGeneAutomaticDownloader",
"default_popup": "options.html",
"default_icon": {
"48": "icons/download-icon.png"
}
},
"options_ui": {
"page": "options.html",
"open_in_tab": false
},
"icons": {
"48": "icons/download-icon.png"
}
}
(function () {
// Set to keep track of processed media URLs to prevent duplicates
const processedMediaUrls = new Set();
// Main function to process media elements starting from a root node
function processMediaElements(rootNode) {
const mediaUrls = [];
if (!rootNode) return;
// Use a Set to avoid processing the same node multiple times
const nodesToProcess = new Set();
// Collect nodes to process
function collectNodes(node) {
if (node.nodeType !== Node.ELEMENT_NODE) return;
nodesToProcess.add(node);
// Recursively collect child nodes
node.querySelectorAll('*').forEach((child) => {
nodesToProcess.add(child);
});
}
collectNodes(rootNode);
// Process each node
nodesToProcess.forEach((node) => {
// Process media elements based on their tag names
const tagName = node.tagName.toLowerCase();
if (tagName === 'img') {
collectImageFromElement(node, mediaUrls);
} else if (tagName === 'video' || tagName === 'audio') {
collectMediaFromElement(node, mediaUrls);
} else if (tagName === 'source') {
collectSourceFromElement(node, mediaUrls);
} else if (tagName === 'picture') {
collectPictureSources(node, mediaUrls);
} else if (tagName === 'object' || tagName === 'embed') {
collectObjectEmbedMedia(node, mediaUrls);
} else if (tagName === 'canvas') {
collectCanvasImage(node);
}
});
// Collect background images from styles
collectBackgroundImages(mediaUrls);
collectInlineStyles(mediaUrls);
collectPseudoElementImages(mediaUrls);
// Process the collected media URLs
processMediaUrls(mediaUrls);
}
// Collect image URLs from <img> elements
function collectImageFromElement(img, mediaUrls) {
const urls = [];
// src attribute
if (img.src) {
urls.push(img.src);
}
// data-src or data-lazy-src attributes for lazy-loaded images
const dataSrc = img.getAttribute('data-src') || img.getAttribute('data-lazy-src');
if (dataSrc) {
urls.push(dataSrc);
}
// srcset attribute
const srcset = img.getAttribute('srcset');
if (srcset) {
const srcsetUrls = srcset
.split(',')
.map((entry) => entry.trim().split(' ')[0])
.filter((url) => url);
urls.push(...srcsetUrls);
}
// Add collected URLs to mediaUrls
mediaUrls.push(...urls);
}
// Collect media URLs from <video> and <audio> elements
function collectMediaFromElement(mediaElement, mediaUrls) {
const urls = [];
// src attribute
if (mediaElement.src) {
urls.push(mediaElement.src);
}
// data-src attribute
const dataSrc = mediaElement.getAttribute('data-src');
if (dataSrc) {
urls.push(dataSrc);
}
// Poster attribute (for videos)
const poster = mediaElement.getAttribute('poster');
if (poster) {
urls.push(poster);
}
// Collect from child <source> elements
mediaElement.querySelectorAll('source').forEach((source) => {
collectSourceFromElement(source, urls);
});
// Add collected URLs to mediaUrls
mediaUrls.push(...urls);
}
// Collect URLs from <source> elements
function collectSourceFromElement(sourceElement, mediaUrls) {
const urls = [];
// src attribute
const src = sourceElement.src || sourceElement.getAttribute('src');
if (src) {
urls.push(src);
}
// data-src attribute
const dataSrc = sourceElement.getAttribute('data-src');
if (dataSrc) {
urls.push(dataSrc);
}
// srcset attribute
const srcset = sourceElement.getAttribute('srcset');
if (srcset) {
const srcsetUrls = srcset
.split(',')
.map((entry) => entry.trim().split(' ')[0])
.filter((url) => url);
urls.push(...srcsetUrls);
}
// Add collected URLs to mediaUrls
mediaUrls.push(...urls);
}
// Collect sources from <picture> elements
function collectPictureSources(pictureElement, mediaUrls) {
pictureElement.querySelectorAll('source').forEach((source) => {
collectSourceFromElement(source, mediaUrls);
});
}
// Collect media from <object> and <embed> elements
function collectObjectEmbedMedia(element, mediaUrls) {
const data = element.getAttribute('data');
if (data) {
mediaUrls.push(data);
}
const src = element.getAttribute('src');
if (src) {
mediaUrls.push(src);
}
}
// Collect images from <canvas> elements
function collectCanvasImage(canvas) {
try {
const dataURL = canvas.toDataURL();
if (dataURL && !processedMediaUrls.has(dataURL)) {
processedMediaUrls.add(dataURL);
chrome.runtime.sendMessage(
{ type: 'downloadDataUrl', dataUrl: dataURL },
function (response) {
if (chrome.runtime.lastError) {
console.error(`Error sending canvas image: ${chrome.runtime.lastError}`);
}
}
);
}
} catch (e) {
console.error('Failed to extract image from canvas:', e);
}
}
// Collect background images from CSS stylesheets
function collectBackgroundImages(mediaUrls) {
for (const sheet of document.styleSheets) {
let rules;
try {
rules = sheet.cssRules;
} catch (e) {
continue; // Skip cross-origin stylesheets
}
if (!rules) continue;
for (const rule of rules) {
if (rule.type === CSSRule.STYLE_RULE) {
const style = rule.style;
const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
extractUrlsFromStyle(bgImage, mediaUrls);
} else if (rule.type === CSSRule.MEDIA_RULE) {
for (const mediaRule of rule.cssRules) {
if (mediaRule.type === CSSRule.STYLE_RULE) {
const style = mediaRule.style;
const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
extractUrlsFromStyle(bgImage, mediaUrls);
}
}
}
}
}
}
// Collect background images from inline styles
function collectInlineStyles(mediaUrls) {
document.querySelectorAll('*[style]').forEach((element) => {
const style = element.getAttribute('style');
extractUrlsFromStyle(style, mediaUrls);
});
}
// Collect images from pseudo-elements
function collectPseudoElementImages(mediaUrls) {
document.querySelectorAll('*').forEach((element) => {
['::before', '::after'].forEach((pseudo) => {
const style = getComputedStyle(element, pseudo);
const bgImage = style.getPropertyValue('background-image') || style.getPropertyValue('background');
extractUrlsFromStyle(bgImage, mediaUrls);
});
});
}
// Extract URLs from CSS style properties
function extractUrlsFromStyle(styleValue, mediaUrls) {
if (styleValue && styleValue !== 'none') {
const urls = styleValue.match(/url\\(["']?([^"')]+)["']?\\)/g);
if (urls) {
urls.forEach((urlString) => {
const url = urlString.match(/url\\(["']?([^"')]+)["']?\\)/)[1];
if (url) {
const absoluteUrl = new URL(url, location.href).href;
mediaUrls.push(absoluteUrl);
}
});
}
}
}
// Process collected media URLs
function processMediaUrls(mediaUrls) {
const uniqueUrls = Array.from(new Set(mediaUrls));
uniqueUrls.forEach((url) => {
const cleanUrl = url.split('#')[0];
if (processedMediaUrls.has(cleanUrl)) {
return;
}
processedMediaUrls.add(cleanUrl);
// Handle data URLs
if (url.startsWith('data:')) {
chrome.runtime.sendMessage(
{ type: 'downloadDataUrl', dataUrl: url },
function (response) {
if (chrome.runtime.lastError) {
console.error(`Error sending data URL: ${chrome.runtime.lastError}`);
}
}
);
return;
}
let filename;
try {
const urlObj = new URL(url, location.href);
filename = urlObj.pathname.split('/').pop() || 'unnamed';
} catch (e) {
console.error(`Invalid URL: ${url}`);
return;
}
// Send message to background script to download the file
chrome.runtime.sendMessage(
{ type: 'download', url: url, filename: filename },
function (response) {
if (chrome.runtime.lastError) {
console.error(`Error sending message for ${filename}: ${chrome.runtime.lastError}`);
}
}
);
});
}
// Enhance MutationObserver to detect attribute changes and added nodes
const observer = new MutationObserver((mutations) => {
mutations.forEach((mutation) => {
if (mutation.type === 'childList') {
// Process added nodes
mutation.addedNodes.forEach((node) => {
processMediaElements(node);
});
} else if (mutation.type === 'attributes') {
processMediaElements(mutation.target);
}
});
});
// Start observing the document for changes
observer.observe(document, {
childList: true,
subtree: true,
attributes: true,
attributeFilter: [
'src',
'srcset',
'data-src',
'data-lazy-src',
'poster',
'style',
'data',
'href',
],
});
// Initial processing
processMediaElements(document);
// Re-process periodically to catch any missed elements
setInterval(() => {
processMediaElements(document);
}, 5000); // Adjust the interval as needed
})();
// Variables to manage downloads and settings
let downloadQueue = [];
let activeDownloads = 0;
let maxConcurrentDownloads = 10;
let keepTrack = true;
let minFileSize = 50 * 1024;
let downloadedFilenames = new Set();
let downloadedUrls = new Set();
// Load initial settings from storage
chrome.storage.local.get(
[
'threads',
'keepTrack',
'minFileSize',
'downloadedFilenames',
'downloadedUrls',
],
(result) => {
maxConcurrentDownloads = result.threads || 10;
keepTrack = result.keepTrack !== false;
minFileSize = (result.minFileSize || 50) * 1024;
if (result.downloadedFilenames && Array.isArray(result.downloadedFilenames)) {
downloadedFilenames = new Set(result.downloadedFilenames);
}
if (result.downloadedUrls && Array.isArray(result.downloadedUrls)) {
downloadedUrls = new Set(result.downloadedUrls);
}
}
);
// Listener for messages from content scripts
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
if (message.type === 'updateSettings') {
// Update settings
chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
maxConcurrentDownloads = result.threads || 10;
keepTrack = result.keepTrack !== false;
minFileSize = (result.minFileSize || 50) * 1024;
console.log(
`Updated settings: maxConcurrentDownloads = ${maxConcurrentDownloads}, keepTrack = ${keepTrack}, minFileSize = ${minFileSize} bytes`
);
});
} else if (message.type === 'resetQueue') {
downloadQueue = [];
sendResponse({ status: 'success' });
} else if (message.type === 'clearHistory') {
downloadedFilenames.clear();
downloadedUrls.clear();
chrome.storage.local.set(
{ downloadedFilenames: [], downloadedUrls: [] },
() => {
sendResponse({ status: 'success' });
}
);
return true;
} else if (message.type === 'download') {
const url = message.url;
const filename = message.filename;
if (keepTrack && downloadedUrls.has(url)) {
console.log(`Skipping download for URL ${url} as it has already been downloaded.`);
sendResponse({ status: 'skipped' });
return;
}
downloadQueue.push({ url: url, filename: filename });
processQueue();
sendResponse({ status: 'queued' });
} else if (message.type === 'downloadDataUrl') {
const dataUrl = message.dataUrl;
const filename = `image_${Date.now()}.png`;
fetch(dataUrl)
.then((res) => res.blob())
.then((blob) => {
const url = URL.createObjectURL(blob);
downloadQueue.push({ url: url, filename: filename, revokeUrl: true });
processQueue();
sendResponse({ status: 'queued' });
})
.catch((error) => {
console.error(`Failed to download data URL: ${error}`);
sendResponse({ status: 'error', error: error.toString() });
});
return true;
}
});
// Function to process the download queue
function processQueue() {
while (activeDownloads < maxConcurrentDownloads && downloadQueue.length > 0) {
const item = downloadQueue.shift();
startDownload(item);
}
}
// Function to start a download
function startDownload(item) {
activeDownloads++;
const filename = item.filename || item.url.split('/').pop();
const fullFilename = `firefox/${sanitizeFilename(filename)}`;
chrome.downloads.download(
{
url: item.url,
filename: fullFilename,
saveAs: false,
conflictAction: 'overwrite',
},
(downloadId) => {
if (chrome.runtime.lastError) {
console.error(`Download failed for ${filename}: ${chrome.runtime.lastError}`);
activeDownloads--;
processQueue();
} else {
console.log(`Download started: ID = ${downloadId}, filename = ${filename}`);
// Listener for changes in the download state
function onChanged(delta) {
if (
delta.id === downloadId &&
delta.state &&
delta.state.current === 'complete'
) {
chrome.downloads.search({ id: downloadId }, function (items) {
if (items && items.length > 0) {
const downloadItem = items[0];
const fileSize = downloadItem.fileSize || downloadItem.totalBytes;
if (fileSize < minFileSize) {
chrome.downloads.removeFile(downloadId, function () {
if (chrome.runtime.lastError) {
console.error(`Failed to remove file: ${chrome.runtime.lastError}`);
} else {
console.log(
`Removed file ${filename} (size ${fileSize} bytes) because it is smaller than the minimum size (${minFileSize} bytes).`
);
}
chrome.downloads.erase({ id: downloadId });
});
} else {
if (keepTrack) {
downloadedFilenames.add(filename);
downloadedUrls.add(item.url);
chrome.storage.local.set({
downloadedFilenames: Array.from(downloadedFilenames),
downloadedUrls: Array.from(downloadedUrls),
});
}
}
}
});
chrome.downloads.onChanged.removeListener(onChanged);
activeDownloads--;
processQueue();
if (item.revokeUrl) {
URL.revokeObjectURL(item.url);
}
} else if (
delta.id === downloadId &&
delta.state &&
delta.state.current === 'interrupted'
) {
console.error(`Download interrupted for ${filename}`);
chrome.downloads.onChanged.removeListener(onChanged);
activeDownloads--;
processQueue();
if (item.revokeUrl) {
URL.revokeObjectURL(item.url);
}
}
}
chrome.downloads.onChanged.addListener(onChanged);
}
}
);
}
// Function to sanitize filenames
function sanitizeFilename(filename) {
return filename.replace(/[\\/:*?"<>|]/g, '_');
}
// Modify webRequest listener to capture media requests
chrome.webRequest.onCompleted.addListener(
(details) => {
const url = details.url;
// Check if the request is for an image or video
if (details.type === 'image' || details.type === 'media') {
if (keepTrack && downloadedUrls.has(url)) {
console.log(`Skipping download for URL ${url} as it has already been downloaded.`);
return;
}
// Extract filename from URL
let filename;
try {
const urlObj = new URL(url);
filename = urlObj.pathname.split('/').pop() || 'unnamed';
} catch (e) {
console.error(`Invalid URL: ${url}`);
return;
}
filename = sanitizeFilename(filename);
// Add the download request to the queue
downloadQueue.push({ url: url, filename: filename });
processQueue();
}
},
{ urls: ['<all_urls>'] },
[]
);
document.addEventListener('DOMContentLoaded', () => {
const threadsInput = document.getElementById('threads');
const keepTrackCheckbox = document.getElementById('keepTrack');
const minFileSizeInput = document.getElementById('minFileSize');
const saveButton = document.getElementById('save');
const resetButton = document.getElementById('reset');
const clearHistoryButton = document.getElementById('clearHistory');
// Load saved settings
chrome.storage.local.get(['threads', 'keepTrack', 'minFileSize'], (result) => {
threadsInput.value = result.threads || 10;
keepTrackCheckbox.checked = result.keepTrack !== false;
minFileSizeInput.value = result.minFileSize || 50;
});
// Save settings
saveButton.addEventListener('click', () => {
const threads = parseInt(threadsInput.value) || 10;
const keepTrack = keepTrackCheckbox.checked;
const minFileSize = parseInt(minFileSizeInput.value) || 50;
chrome.storage.local.set(
{ threads: threads, keepTrack: keepTrack, minFileSize: minFileSize },
() => {
alert('Settings saved.');
chrome.runtime.sendMessage({ type: 'updateSettings' });
}
);
});
// Reset download queue
resetButton.addEventListener('click', () => {
chrome.runtime.sendMessage({ type: 'resetQueue' }, (response) => {
if (response.status === 'success') {
alert('Download queue reset.');
} else {
alert('Failed to reset queue.');
}
});
});
// Clear download history
clearHistoryButton.addEventListener('click', () => {
chrome.runtime.sendMessage({ type: 'clearHistory' }, (response) => {
if (response.status === 'success') {
alert('Downloaded filenames history cleared.');
} else {
alert('Failed to clear history.');
}
});
});
});
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>nGeneAutomaticDownloader Options</title>
<style>
body { font-family: Arial, sans-serif; padding: 10px; }
label { display: block; margin-bottom: 5px; }
input[type="number"] { width: 50px; }
button { margin-top: 10px; margin-right: 10px; }
</style>
</head>
<body>
<h2>Extension Settings</h2>
<label>
Number of Concurrent Downloads:
<input type="number" id="threads" min="1" max="10">
</label>
<label>
Keep track of downloaded filenames:
<input type="checkbox" id="keepTrack">
</label>
<label>
Minimum File Size (KB):
<input type="number" id="minFileSize" min="0" step="1">
</label>
<button id="save">Save Settings</button>
<button id="reset">Reset Download Queue</button>
<button id="clearHistory">Clear Downloaded Filenames</button>
<script src="options.js"></script>
</body>
</html>
The manifest.json
file serves as the blueprint for the Firefox extension, delineating its metadata, permissions, and the scripts it employs.
downloads
, tabs
, <all_urls>
, storage
, webRequest
, and webRequestBlocking
. These permissions enable the extension to monitor and interact with web content, manage downloads, and store user settings.
background.js
script is specified under the background
property, functioning as the extension's core controller for managing downloads and maintaining state.
content.js
script is injected into all webpages (<all_urls>
) except for about:*
and resource://*/*
URLs. It operates at the document_idle
stage, ensuring that the DOM is fully loaded before execution.
options.html
) and an icon. The options_ui
section configures the settings page, allowing users to adjust preferences without opening a new tab.
The content.js
script is responsible for identifying and extracting media elements (images and videos) from the webpages the user visits. Its implementation encompasses several key features:
<img>
, <video>
, <audio>
, <source>
, <picture>
, <object>
, <embed>
, and <canvas>
. It also inspects CSS styles for background images and pseudo-elements (::before
, ::after
).
src
, data-src
, srcset
, and inline styles. It ensures that only unique and valid URLs are processed to prevent duplicate downloads.
<canvas>
elements by converting their content to data URLs. If successful, it sends these data URLs to the background script for downloading.
MutationObserver
, the script monitors the DOM for changes, such as the addition of new media elements or modifications to existing ones. This ensures that dynamically loaded content is also captured and processed.
background.js
script via messages, requesting downloads and passing necessary information like URLs and filenames.
The background.js
script orchestrates the downloading process, managing download queues, threads, and tracking mechanisms.
downloadQueue
array to hold pending download requests. It processes this queue based on the number of active downloads and the user-defined maximum concurrent downloads (maxConcurrentDownloads
).
activeDownloads
variable. It ensures that no more than the specified number of concurrent downloads are active at any given time, thereby managing threading effectively.
chrome.storage.local
. These settings influence how the script processes and prioritizes download tasks.
Set
objects: downloadedFilenames
and downloadedUrls
. These sets store the names and URLs of files that have already been downloaded, ensuring that the same file is not downloaded multiple times.
chrome.downloads.download
API. It listens for changes in the download state to determine when a download is complete or interrupted. Upon completion, it verifies the file size against the minimum threshold and removes files that do not meet this criterion. Successfully downloaded files are recorded in the tracking sets.
<canvas>
elements), the script fetches the blob data and converts it into a downloadable URL. These URLs are then added to the download queue for processing.
webRequest.onCompleted
listener to capture media requests directly from network traffic. This complements the DOM-based detection in content.js
, ensuring comprehensive media coverage.
The options.html
and options.js
files collectively provide a user interface for configuring the extension's settings.
chrome.storage.local
. The background script is notified to update its configuration accordingly.Thread management is pivotal in handling multiple downloads efficiently without overwhelming system resources.
maxConcurrentDownloads
), from storage.
processQueue
function oversees the download queue, initiating downloads as long as the number of active downloads is below the specified limit. This function is invoked whenever a new download is added to the queue or when an active download completes.
activeDownloads
count upon initiation. Listeners monitor the download's progress, decrementing the count and triggering the processing of subsequent queued downloads upon completion or interruption.
To avoid redundant downloads and optimize performance, the extension implements a robust tracking system.
Set
objects, downloadedFilenames
and downloadedUrls
, store the names and URLs of files that have been successfully downloaded.
downloadedUrls
set. If it does, the download is skipped to prevent duplication.
The extension is tailored to download specific media file types, primarily images and videos.
content.js
) and the background script (background.js
) focus on media elements and web requests associated with images (<img>
, background images) and videos (<video>
, <audio>
, <source>
, etc.).
<canvas>
elements, the content script extracts the data and forwards it to the background script for conversion and downloading.
The extension's user interface is designed for simplicity and ease of use.
options.html
file employs basic CSS to structure the settings form, ensuring that users can intuitively navigate and adjust preferences.
Written on November 30th, 2024