Passa al contenuto principale

Crawling e Elaborazione URL

Il sistema di crawling gestisce l'estrazione di contenuti da URL, sitemap e pagine web per la generazione di embedding per la ricerca semantica.

Tecnologie Utilizzate

Crawl4AI

  • Scopo: Crawling avanzato e estrazione contenuti web
  • Caratteristiche: JavaScript rendering, estrazione pulita, gestione dinamica
  • Configurazione: Rate limiting, user agents, timeout

Node HTML Markdown

  • Scopo: Conversione HTML in Markdown
  • Vantaggi: Preservazione struttura, rimozione tag inutili
  • Output: Testo pulito e formattato

Tipi di Crawling Supportati

1. URL Singolo

  • Uso: Pagine web individuali
  • Processo: Crawling diretto, estrazione contenuto
  • Metadati: URL, titolo, descrizione, timestamp

2. URL Multipli

  • Uso: Lista di URL specifici
  • Processo: Crawling batch, elaborazione parallela
  • Metadati: URL di origine, posizione nella lista

3. Sitemap XML

  • Uso: Crawling automatico di siti web
  • Processo: Parsing sitemap, estrazione URL, crawling batch
  • Metadati: Sitemap di origine, profondità crawling

Architettura del Sistema

Crawl4AI Service

/src/services/crawl4AIService.ts
// Configurazione Crawl4AI
const crawl4AIConfig = {
wordCountThreshold: 10,
maxRetries: 3,
delay: 1000,
userAgent: 'Tidiko-AI-Crawler/1.0',
timeout: 30000,
}

Processo di Crawling

/src/services/crawl4AIService.ts
// Crawling principale
const crawlUrl = async (url: string, config: CrawlConfig): Promise<CrawlResult> => {
const crawler = new Crawl4AI(config)
const result = await crawler.crawl(url)
return processCrawlResult(result)
}

Configurazione Crawling

Rate Limiting

/src/services/crawl4AIService.ts
// Configurazione rate limiting
const RATE_LIMITS = {
requestsPerMinute: 60,
delayBetweenRequests: 1000,
maxConcurrentRequests: 5,
}

User Agent e Headers

/src/services/crawl4AIService.ts
// Configurazione headers
const CRAWLER_HEADERS = {
'User-Agent': 'Tidiko-AI-Crawler/1.0 (https://tidiko.ai)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
}

Processo di Elaborazione

1. Validazione URL

/src/services/embeddingsService.ts
// Validazione e normalizzazione URL
const validateAndNormalizeUrl = (url: string): string => {
try {
const normalizedUrl = normalizeUrl(url)

// Verifica protocollo supportato
if (!normalizedUrl.startsWith('http://') && !normalizedUrl.startsWith('https://')) {
throw new Error('Only HTTP and HTTPS URLs are supported')
}

return normalizedUrl
} catch (error) {
throw new Error(`Invalid URL: ${error.message}`)
}
}

2. Crawling del Contenuto

/src/services/embeddingsService.ts
// Processo di crawling
const processUrl = async (url: string, type: UrlType, ctx: Context): Promise<LoaderResult[]> => {
const crawlResult = await crawl4AIService.crawl(url, {
wordCountThreshold: 10,
maxRetries: 3,
delay: 1000,
})

return processCrawlResult(crawlResult, url, type)
}

3. Estrazione e Pulizia Contenuto

/src/services/crawl4AIService.ts
// Estrazione contenuto pulito
const extractCleanContent = (html: string): string => {
// Conversione HTML to Markdown
const markdown = convertHtmlToMarkdown(html)

// Pulizia contenuto
const cleaned = cleanMarkdown(markdown)

return cleaned
}

4. Suddivisione in Chunk

/src/services/jinaSimpleEmbeddingService.ts
// Suddivisione contenuto web
const splitWebContent = (content: string): string[] => {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ['\n\n', '\n', '. ', '! ', '? ', ' ', ''],
})

return splitter.splitText(content)
}

Gestione Sitemap

Parsing Sitemap XML

/src/services/embeddingsService.ts
// Parsing sitemap XML
const parseSitemap = async (sitemapUrl: string): Promise<string[]> => {
const response = await axios.get(sitemapUrl)
const sitemap = await parseString(response.data)

const urls: string[] = []

// Estrazione URL da sitemap
if (sitemap.urlset?.url) {
sitemap.urlset.url.forEach((url: any) => {
if (url.loc && Array.isArray(url.loc)) {
urls.push(url.loc[0])
}
})
}

return urls
}

Crawling Batch da Sitemap

/src/services/embeddingsService.ts
// Crawling batch da sitemap
const crawlSitemap = async (sitemapUrl: string, ctx: Context): Promise<LoaderResult[]> => {
const urls = await parseSitemap(sitemapUrl)
const results: LoaderResult[] = []

// Crawling batch con rate limiting
for (const url of urls) {
try {
const result = await processUrl(url, 'sitemap', ctx)
results.push(...result)

// Rate limiting
await new Promise(resolve => setTimeout(resolve, 1000))
} catch (error) {
logError(ctx, 'Sitemap crawling failed', { url, error: error.message })
}
}

return results
}

Metadati Web

Metadati URL

/src/types/crawl4aiTypes.ts
interface UrlMetadata {
url: string
title?: string
description?: string
author?: string
published_date?: string
last_modified?: string
language?: string
category?: string
tags?: string[]
}

Metadati Crawling

/src/types/crawl4aiTypes.ts
interface CrawlingMetadata {
crawl_date: string
crawl_type: 'single' | 'multiple' | 'sitemap'
sitemap_url?: string
depth_level?: number
parent_url?: string
response_time: number
status_code: number
}

Gestione Errori

Errori di Crawling

/src/services/crawl4AIService.ts
// Gestione errori specifici
try {
const result = await crawlUrl(url)
} catch (error) {
if (error.code === 'TIMEOUT') {
logError(ctx, 'Crawling timeout', { url, timeout: 30000 })
} else if (error.code === 'BLOCKED') {
logError(ctx, 'Crawling blocked', { url, userAgent: CRAWLER_HEADERS['User-Agent'] })
} else if (error.code === 'INVALID_CONTENT') {
logError(ctx, 'Invalid content extracted', { url, contentLength: result.content?.length })
}

throw new Error(`Crawling failed for ${url}: ${error.message}`)
}

Retry Logic

/src/services/crawl4AIService.ts
// Configurazione retry per crawling
const CRAWLING_RETRY_CONFIG = {
maxRetries: 3,
backoff: {
initialDelay: 1000,
maxDelay: 10000,
factor: 2,
},
}

Performance e Ottimizzazione

Crawling Parallelo

/src/services/embeddingsService.ts
// Crawling parallelo con limiti
const crawlMultipleUrls = async (
urls: string[],
maxConcurrent: number = 5
): Promise<LoaderResult[]> => {
const semaphore = new Semaphore(maxConcurrent)
const results: LoaderResult[] = []

const crawlPromises = urls.map(async url => {
await semaphore.acquire()
try {
const result = await processUrl(url, 'multiple', ctx)
return result
} finally {
semaphore.release()
}
})

const allResults = await Promise.all(crawlPromises)
return allResults.flat()
}

Caching Contenuti

/src/services/crawl4AIService.ts
// Cache per contenuti già crawlatati
const contentCache = new Map<string, CrawlResult>()

const getCachedContent = (url: string): CrawlResult | null => {
return contentCache.get(url) || null
}

const setCachedContent = (url: string, content: CrawlResult): void => {
contentCache.set(url, content)
}

Configurazione Avanzata

Filtri Contenuto

/src/services/crawl4AIService.ts
// Filtri per contenuto web
const CONTENT_FILTERS = {
minWordCount: 10,
maxWordCount: 10000,
allowedDomains: ['example.com', 'subdomain.example.com'],
blockedDomains: ['ads.example.com', 'tracking.example.com'],
allowedPaths: [ '/articles/', '/docs/'],
blockedPaths: ['/admin/', '/private/', '/api/'],
}

Gestione JavaScript

/src/services/crawl4AIService.ts
// Configurazione rendering JavaScript
const JAVASCRIPT_CONFIG = {
enableJavaScript: true,
waitForSelector: '.content',
waitTimeout: 5000,
executeScripts: [
'window.scrollTo(0, document.body.scrollHeight)',
'document.querySelectorAll("[data-lazy]").forEach(el => el.click())',
],
}

Monitoraggio

Metriche Crawling

  • URLs Crawled: Numero URL elaborati
  • Success Rate: Percentuale successo crawling
  • Average Response Time: Tempo medio risposta
  • Content Quality: Qualità contenuto estratto

Logging Dettagliato

/src/utilities/loggerUtility.ts
// Log per ogni fase crawling
logInfo(ctx, 'Crawling started', {
url,
type: crawlType,
timestamp: new Date().toISOString(),
})

logInfo(ctx, 'Content extracted', {
url,
contentLength: content.length,
wordCount: content.split(' ').length,
})

Sicurezza e Compliance

Robots.txt Compliance

/src/services/crawl4AIService.ts
// Verifica robots.txt
const checkRobotsTxt = async (url: string): Promise<boolean> => {
try {
const robotsUrl = new URL('/robots.txt', url).toString()
const response = await axios.get(robotsUrl)
// Implementazione parsing robots.txt
return isUrlAllowed(url, response.data)
} catch (error) {
return true // Default allow se robots.txt non disponibile
}
}

Rate Limiting Rispettoso

/src/services/crawl4AIService.ts
// Rate limiting rispettoso per sito
const siteRateLimits = new Map<string, number>()

const checkSiteRateLimit = (url: string): boolean => {
const domain = new URL(url).hostname
const lastRequest = siteRateLimits.get(domain) || 0
const now = Date.now()

if (now - lastRequest < 1000) {
// 1 secondo tra richieste
return false
}

siteRateLimits.set(domain, now)
return true
}