Passa al contenuto principale

Elaborazione Documenti

Il sistema di elaborazione documenti gestisce il parsing, la suddivisione e l'elaborazione di vari formati di file per la generazione di embedding.

Formati Supportati

Documenti di Testo

  • TXT: File di testo semplice
  • CSV: Tabelle separate da virgola
  • JSON: Strutture dati JSON

Documenti Office

  • PDF: Documenti PDF con estrazione testo
  • DOCX: Documenti Microsoft Word
  • XLSX: Fogli di calcolo Excel

Documenti Web

  • XML: Documenti XML strutturati
  • HTML: Pagine web (parsing semplificato)

Architettura del Sistema

File Loader Pipeline

/src/services/documetsLoader/filesLoader.ts
// Processo di caricamento file
const processFiles = async (
files: FileInput[],
collectionName: string,
resourceUuid: string,
ctx: Context
): Promise<LoaderResult[]>

Componenti Principali

  • Multer: Upload e gestione file
  • Document Loaders: Parser specifici per formato
  • Text Splitters: Suddivisione in chunk
  • Metadata Extractors: Estrazione metadati

Processori per Formato

1. PDF Processing

/src/services/documetsLoader/filesLoader.ts
// PDF Parser
import pdf from 'pdf-parse'

const processPDF = async (buffer: Buffer): Promise<string> => {
const data = await pdf(buffer)
return data.text
}

Caratteristiche:

  • Estrazione testo da PDF
  • Gestione layout complessi
  • Supporto PDF protetti (con password)

2. DOCX Processing

/src/services/documetsLoader/filesLoader.ts
// Word Parser
import mammoth from 'mammoth'

const processDOCX = async (buffer: Buffer): Promise<string> => {
const result = await mammoth.extractRawText({ buffer })
return result.value
}

Caratteristiche:

  • Estrazione testo da Word
  • Preservazione formattazione base
  • Gestione tabelle e liste

3. Excel Processing

/src/services/documetsLoader/filesLoader.ts
// Excel Parser
import ExcelJS from 'exceljs'

const processXLSX = async (buffer: Buffer): Promise<string> => {
const workbook = new ExcelJS.Workbook()
await workbook.xlsx.load(buffer)

let content = ''
workbook.eachSheet(worksheet => {
worksheet.eachRow(row => {
content += row.values.join('\t') + '\n'
})
})

return content
}

Caratteristiche:

  • Supporto fogli multipli
  • Preservazione struttura tabellare
  • Gestione celle vuote

4. CSV Processing

/src/services/documetsLoader/filesLoader.ts
// CSV Parser
import { csvParse } from 'd3-dsv'

const processCSV = async (content: string): Promise<string> => {
const data = csvParse(content)
return data.map(row => Object.values(row).join('\t')).join('\n')
}

Caratteristiche:

  • Parsing automatico header
  • Gestione separatori personalizzati
  • Conversione in formato tabellare

5. JSON Processing

/src/services/documetsLoader/filesLoader.ts
// JSON Parser
const processJSON = async (content: string): Promise<string> => {
const data = JSON.parse(content)
return JSON.stringify(data, null, 2)
}

Caratteristiche:

  • Parsing strutture JSON complesse
  • Formattazione leggibile
  • Gestione array e oggetti annidati

6. XML Processing

/src/services/documetsLoader/filesLoader.ts
// XML Parser
import { parseString } from 'xml2js'

const processXML = async (content: string): Promise<string> => {
const result = await parseString(content)
return JSON.stringify(result, null, 2)
}

Caratteristiche:

  • Parsing XML in oggetti JavaScript
  • Gestione attributi e namespace
  • Conversione in formato JSON

Metadati Estratti

Metadati File

/src/types/embeddingsTypes.ts
interface FileMetadata {
file_name: string
file_type: string
file_size: number
upload_date: string
resource_uuid: string
collection_name: string
}

Metadati Contenuto

/src/types/embeddingsTypes.ts
interface ContentMetadata {
chunk_index: number
total_chunks: number
chunk_size: number
page_number?: number
sheet_name?: string
section_title?: string
}

Processo di Elaborazione

1. Upload e Validazione

/src/services/tasksService.ts
// Validazione file
const validateFile = (file: Express.Multer.File): boolean => {
const allowedTypes = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'text/plain',
'text/csv',
'application/json',
'application/xml',
]

return allowedTypes.includes(file.mimetype)
}

2. Parsing per Formato

/src/services/documetsLoader/filesLoader.ts
// Router per tipo file
const parseFile = async (file: FileInput): Promise<string> => {
switch (file.mimetype) {
case 'application/pdf':
return await processPDF(file.buffer)
case 'application/vnd.openxmlformats-officedocument.wordprocessingml.document':
return await processDOCX(file.buffer)
case 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
return await processXLSX(file.buffer)
case 'text/csv':
return await processCSV(file.buffer.toString())
case 'application/json':
return await processJSON(file.buffer.toString())
case 'application/xml':
return await processXML(file.buffer.toString())
default:
return file.buffer.toString()
}
}

3. Suddivisione in Chunk

/src/services/jinaSimpleEmbeddingService.ts
// Text splitting ottimizzato
const splitContent = (content: string): string[] => {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ['\n\n', '\n', '. ', '! ', '? ', ' ', ''],
})

return splitter.splitText(content)
}

4. Generazione Metadati

/src/services/documetsLoader/filesLoader.ts
// Metadati per chunk
const generateChunkMetadata = (
chunk: string,
index: number,
total: number,
file: FileInput
): BaseQdrantMetadata<BaseQdrantMetadataType.DOCUMENT> => {
return {
source: file.originalname,
type: BaseQdrantMetadataType.DOCUMENT,
resource_uuid: file.resourceUuid,
collection_name: file.collectionName,
chunk_index: index,
total_chunks: total,
file_name: file.originalname,
file_type: file.mimetype,
chunk_size: chunk.length,
}
}

Gestione Errori

Errori di Parsing

/src/services/documetsLoader/filesLoader.ts
// Gestione errori specifici per formato
try {
const content = await parseFile(file)
} catch (error) {
if (error instanceof PDFParseError) {
logError(ctx, 'PDF parsing failed', { error: error.message })
} else if (error instanceof ExcelError) {
logError(ctx, 'Excel parsing failed', { error: error.message })
}
throw new Error(`Failed to parse ${file.mimetype}: ${error.message}`)
}

Errori di Validazione

/src/services/tasksService.ts
// Validazione dimensioni file
const MAX_FILE_SIZE = 50 * 1024 * 1024 // 50MB

if (file.size > MAX_FILE_SIZE) {
throw new Error('File size exceeds maximum allowed size')
}

Performance e Ottimizzazione

Streaming per File Grandi

/src/services/documetsLoader/filesLoader.ts
// Processing streaming per file grandi
const processLargeFile = async (file: FileInput): Promise<LoaderResult[]> => {
const stream = fs.createReadStream(file.path)
const chunks: string[] = []

return new Promise((resolve, reject) => {
stream.on('data', chunk => {
chunks.push(chunk.toString())
})

stream.on('end', async () => {
const content = chunks.join('')
const splitContent = splitText(content)
resolve(processChunks(splitContent, file))
})
})
}

Caching Metadati

/src/services/documetsLoader/filesLoader.ts
// Cache per metadati file simili
const metadataCache = new Map<string, FileMetadata>()

const getCachedMetadata = (fileHash: string): FileMetadata | null => {
return metadataCache.get(fileHash) || null
}

Configurazione

Limiti File

/src/services/tasksService.ts
// Configurazione limiti
const FILE_LIMITS = {
maxSize: 50 * 1024 * 1024, // 50MB
maxChunks: 1000,
chunkSize: 1000,
chunkOverlap: 200,
}

Tipi File Supportati

/src/services/tasksService.ts
// MIME types supportati
const SUPPORTED_TYPES = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'text/plain',
'text/csv',
'application/json',
'application/xml',
]

Monitoraggio

Metriche Importanti

  • File Processed: Numero file elaborati per tipo
  • Parsing Success Rate: Percentuale successo parsing
  • Average Processing Time: Tempo medio elaborazione
  • Chunk Generation: Numero chunk generati per file

Logging Dettagliato

/src/utilities/loggerUtility.ts
// Log per ogni fase
logInfo(ctx, 'File processing started', {
fileName: file.originalname,
fileType: file.mimetype,
fileSize: file.size,
})

logInfo(ctx, 'File parsed successfully', {
contentLength: content.length,
chunksGenerated: chunks.length,
})