Crawling e Elaborazione URL
Il sistema di crawling gestisce l'estrazione di contenuti da URL, sitemap e pagine web per la generazione di embedding per la ricerca semantica.
Tecnologie Utilizzate
Crawl4AI
- Scopo: Crawling avanzato e estrazione contenuti web
- Caratteristiche: JavaScript rendering, estrazione pulita, gestione dinamica
- Configurazione: Rate limiting, user agents, timeout
Node HTML Markdown
- Scopo: Conversione HTML in Markdown
- Vantaggi: Preservazione struttura, rimozione tag inutili
- Output: Testo pulito e formattato
Tipi di Crawling Supportati
1. URL Singolo
- Uso: Pagine web individuali
- Processo: Crawling diretto, estrazione contenuto
- Metadati: URL, titolo, descrizione, timestamp
2. URL Multipli
- Uso: Lista di URL specifici
- Processo: Crawling batch, elaborazione parallela
- Metadati: URL di origine, posizione nella lista
3. Sitemap XML
- Uso: Crawling automatico di siti web
- Processo: Parsing sitemap, estrazione URL, crawling batch
- Metadati: Sitemap di origine, profondità crawling
Architettura del Sistema
Crawl4AI Service
/src/services/crawl4AIService.ts
// Configurazione Crawl4AI
const crawl4AIConfig = {
wordCountThreshold: 10,
maxRetries: 3,
delay: 1000,
userAgent: 'Tidiko-AI-Crawler/1.0',
timeout: 30000,
}
Processo di Crawling
/src/services/crawl4AIService.ts
// Crawling principale
const crawlUrl = async (url: string, config: CrawlConfig): Promise<CrawlResult> => {
const crawler = new Crawl4AI(config)
const result = await crawler.crawl(url)
return processCrawlResult(result)
}
Configurazione Crawling
Rate Limiting
/src/services/crawl4AIService.ts
// Configurazione rate limiting
const RATE_LIMITS = {
requestsPerMinute: 60,
delayBetweenRequests: 1000,
maxConcurrentRequests: 5,
}
User Agent e Headers
/src/services/crawl4AIService.ts
// Configurazione headers
const CRAWLER_HEADERS = {
'User-Agent': 'Tidiko-AI-Crawler/1.0 (https://tidiko.ai)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
}
Processo di Elaborazione
1. Validazione URL
/src/services/embeddingsService.ts
// Validazione e normalizzazione URL
const validateAndNormalizeUrl = (url: string): string => {
try {
const normalizedUrl = normalizeUrl(url)
// Verifica protocollo supportato
if (!normalizedUrl.startsWith('http://') && !normalizedUrl.startsWith('https://')) {
throw new Error('Only HTTP and HTTPS URLs are supported')
}
return normalizedUrl
} catch (error) {
throw new Error(`Invalid URL: ${error.message}`)
}
}
2. Crawling del Contenuto
/src/services/embeddingsService.ts
// Processo di crawling
const processUrl = async (url: string, type: UrlType, ctx: Context): Promise<LoaderResult[]> => {
const crawlResult = await crawl4AIService.crawl(url, {
wordCountThreshold: 10,
maxRetries: 3,
delay: 1000,
})
return processCrawlResult(crawlResult, url, type)
}
3. Estrazione e Pulizia Contenuto
/src/services/crawl4AIService.ts
// Estrazione contenuto pulito
const extractCleanContent = (html: string): string => {
// Conversione HTML to Markdown
const markdown = convertHtmlToMarkdown(html)
// Pulizia contenuto
const cleaned = cleanMarkdown(markdown)
return cleaned
}
4. Suddivisione in Chunk
/src/services/jinaSimpleEmbeddingService.ts
// Suddivisione contenuto web
const splitWebContent = (content: string): string[] => {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ['\n\n', '\n', '. ', '! ', '? ', ' ', ''],
})
return splitter.splitText(content)
}
Gestione Sitemap
Parsing Sitemap XML
/src/services/embeddingsService.ts
// Parsing sitemap XML
const parseSitemap = async (sitemapUrl: string): Promise<string[]> => {
const response = await axios.get(sitemapUrl)
const sitemap = await parseString(response.data)
const urls: string[] = []
// Estrazione URL da sitemap
if (sitemap.urlset?.url) {
sitemap.urlset.url.forEach((url: any) => {
if (url.loc && Array.isArray(url.loc)) {
urls.push(url.loc[0])
}
})
}
return urls
}
Crawling Batch da Sitemap
/src/services/embeddingsService.ts
// Crawling batch da sitemap
const crawlSitemap = async (sitemapUrl: string, ctx: Context): Promise<LoaderResult[]> => {
const urls = await parseSitemap(sitemapUrl)
const results: LoaderResult[] = []
// Crawling batch con rate limiting
for (const url of urls) {
try {
const result = await processUrl(url, 'sitemap', ctx)
results.push(...result)
// Rate limiting
await new Promise(resolve => setTimeout(resolve, 1000))
} catch (error) {
logError(ctx, 'Sitemap crawling failed', { url, error: error.message })
}
}
return results
}
Metadati Web
Metadati URL
/src/types/crawl4aiTypes.ts
interface UrlMetadata {
url: string
title?: string
description?: string
author?: string
published_date?: string
last_modified?: string
language?: string
category?: string
tags?: string[]
}
Metadati Crawling
/src/types/crawl4aiTypes.ts
interface CrawlingMetadata {
crawl_date: string
crawl_type: 'single' | 'multiple' | 'sitemap'
sitemap_url?: string
depth_level?: number
parent_url?: string
response_time: number
status_code: number
}
Gestione Errori
Errori di Crawling
/src/services/crawl4AIService.ts
// Gestione errori specifici
try {
const result = await crawlUrl(url)
} catch (error) {
if (error.code === 'TIMEOUT') {
logError(ctx, 'Crawling timeout', { url, timeout: 30000 })
} else if (error.code === 'BLOCKED') {
logError(ctx, 'Crawling blocked', { url, userAgent: CRAWLER_HEADERS['User-Agent'] })
} else if (error.code === 'INVALID_CONTENT') {
logError(ctx, 'Invalid content extracted', { url, contentLength: result.content?.length })
}
throw new Error(`Crawling failed for ${url}: ${error.message}`)
}
Retry Logic
/src/services/crawl4AIService.ts
// Configurazione retry per crawling
const CRAWLING_RETRY_CONFIG = {
maxRetries: 3,
backoff: {
initialDelay: 1000,
maxDelay: 10000,
factor: 2,
},
}
Performance e Ottimizzazione
Crawling Parallelo
/src/services/embeddingsService.ts
// Crawling parallelo con limiti
const crawlMultipleUrls = async (
urls: string[],
maxConcurrent: number = 5
): Promise<LoaderResult[]> => {
const semaphore = new Semaphore(maxConcurrent)
const results: LoaderResult[] = []
const crawlPromises = urls.map(async url => {
await semaphore.acquire()
try {
const result = await processUrl(url, 'multiple', ctx)
return result
} finally {
semaphore.release()
}
})
const allResults = await Promise.all(crawlPromises)
return allResults.flat()
}
Caching Contenuti
/src/services/crawl4AIService.ts
// Cache per contenuti già crawlatati
const contentCache = new Map<string, CrawlResult>()
const getCachedContent = (url: string): CrawlResult | null => {
return contentCache.get(url) || null
}
const setCachedContent = (url: string, content: CrawlResult): void => {
contentCache.set(url, content)
}
Configurazione Avanzata
Filtri Contenuto
/src/services/crawl4AIService.ts
// Filtri per contenuto web
const CONTENT_FILTERS = {
minWordCount: 10,
maxWordCount: 10000,
allowedDomains: ['example.com', 'subdomain.example.com'],
blockedDomains: ['ads.example.com', 'tracking.example.com'],
allowedPaths: [ '/articles/', '/docs/'],
blockedPaths: ['/admin/', '/private/', '/api/'],
}
Gestione JavaScript
/src/services/crawl4AIService.ts
// Configurazione rendering JavaScript
const JAVASCRIPT_CONFIG = {
enableJavaScript: true,
waitForSelector: '.content',
waitTimeout: 5000,
executeScripts: [
'window.scrollTo(0, document.body.scrollHeight)',
'document.querySelectorAll("[data-lazy]").forEach(el => el.click())',
],
}
Monitoraggio
Metriche Crawling
- URLs Crawled: Numero URL elaborati
- Success Rate: Percentuale successo crawling
- Average Response Time: Tempo medio risposta
- Content Quality: Qualità contenuto estratto
Logging Dettagliato
/src/utilities/loggerUtility.ts
// Log per ogni fase crawling
logInfo(ctx, 'Crawling started', {
url,
type: crawlType,
timestamp: new Date().toISOString(),
})
logInfo(ctx, 'Content extracted', {
url,
contentLength: content.length,
wordCount: content.split(' ').length,
})
Sicurezza e Compliance
Robots.txt Compliance
/src/services/crawl4AIService.ts
// Verifica robots.txt
const checkRobotsTxt = async (url: string): Promise<boolean> => {
try {
const robotsUrl = new URL('/robots.txt', url).toString()
const response = await axios.get(robotsUrl)
// Implementazione parsing robots.txt
return isUrlAllowed(url, response.data)
} catch (error) {
return true // Default allow se robots.txt non disponibile
}
}
Rate Limiting Rispettoso
/src/services/crawl4AIService.ts
// Rate limiting rispettoso per sito
const siteRateLimits = new Map<string, number>()
const checkSiteRateLimit = (url: string): boolean => {
const domain = new URL(url).hostname
const lastRequest = siteRateLimits.get(domain) || 0
const now = Date.now()
if (now - lastRequest < 1000) {
// 1 secondo tra richieste
return false
}
siteRateLimits.set(domain, now)
return true
}