Elaborazione Documenti
Il sistema di elaborazione documenti gestisce il parsing, la suddivisione e l'elaborazione di vari formati di file per la generazione di embedding.
Formati Supportati
Documenti di Testo
- TXT: File di testo semplice
- CSV: Tabelle separate da virgola
- JSON: Strutture dati JSON
Documenti Office
- PDF: Documenti PDF con estrazione testo
- DOCX: Documenti Microsoft Word
- XLSX: Fogli di calcolo Excel
Documenti Web
- XML: Documenti XML strutturati
- HTML: Pagine web (parsing semplificato)
Architettura del Sistema
File Loader Pipeline
/src/services/documetsLoader/filesLoader.ts
// Processo di caricamento file
const processFiles = async (
files: FileInput[],
collectionName: string,
resourceUuid: string,
ctx: Context
): Promise<LoaderResult[]>
Componenti Principali
- Multer: Upload e gestione file
- Document Loaders: Parser specifici per formato
- Text Splitters: Suddivisione in chunk
- Metadata Extractors: Estrazione metadati
Processori per Formato
1. PDF Processing
/src/services/documetsLoader/filesLoader.ts
// PDF Parser
import pdf from 'pdf-parse'
const processPDF = async (buffer: Buffer): Promise<string> => {
const data = await pdf(buffer)
return data.text
}
Caratteristiche:
- Estrazione testo da PDF
- Gestione layout complessi
- Supporto PDF protetti (con password)
2. DOCX Processing
/src/services/documetsLoader/filesLoader.ts
// Word Parser
import mammoth from 'mammoth'
const processDOCX = async (buffer: Buffer): Promise<string> => {
const result = await mammoth.extractRawText({ buffer })
return result.value
}
Caratteristiche:
- Estrazione testo da Word
- Preservazione formattazione base
- Gestione tabelle e liste
3. Excel Processing
/src/services/documetsLoader/filesLoader.ts
// Excel Parser
import ExcelJS from 'exceljs'
const processXLSX = async (buffer: Buffer): Promise<string> => {
const workbook = new ExcelJS.Workbook()
await workbook.xlsx.load(buffer)
let content = ''
workbook.eachSheet(worksheet => {
worksheet.eachRow(row => {
content += row.values.join('\t') + '\n'
})
})
return content
}
Caratteristiche:
- Supporto fogli multipli
- Preservazione struttura tabellare
- Gestione celle vuote
4. CSV Processing
/src/services/documetsLoader/filesLoader.ts
// CSV Parser
import { csvParse } from 'd3-dsv'
const processCSV = async (content: string): Promise<string> => {
const data = csvParse(content)
return data.map(row => Object.values(row).join('\t')).join('\n')
}
Caratteristiche:
- Parsing automatico header
- Gestione separatori personalizzati
- Conversione in formato tabellare
5. JSON Processing
/src/services/documetsLoader/filesLoader.ts
// JSON Parser
const processJSON = async (content: string): Promise<string> => {
const data = JSON.parse(content)
return JSON.stringify(data, null, 2)
}
Caratteristiche:
- Parsing strutture JSON complesse
- Formattazione leggibile
- Gestione array e oggetti annidati
6. XML Processing
/src/services/documetsLoader/filesLoader.ts
// XML Parser
import { parseString } from 'xml2js'
const processXML = async (content: string): Promise<string> => {
const result = await parseString(content)
return JSON.stringify(result, null, 2)
}
Caratteristiche:
- Parsing XML in oggetti JavaScript
- Gestione attributi e namespace
- Conversione in formato JSON
Metadati Estratti
Metadati File
/src/types/embeddingsTypes.ts
interface FileMetadata {
file_name: string
file_type: string
file_size: number
upload_date: string
resource_uuid: string
collection_name: string
}
Metadati Contenuto
/src/types/embeddingsTypes.ts
interface ContentMetadata {
chunk_index: number
total_chunks: number
chunk_size: number
page_number?: number
sheet_name?: string
section_title?: string
}
Processo di Elaborazione
1. Upload e Validazione
/src/services/tasksService.ts
// Validazione file
const validateFile = (file: Express.Multer.File): boolean => {
const allowedTypes = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'text/plain',
'text/csv',
'application/json',
'application/xml',
]
return allowedTypes.includes(file.mimetype)
}
2. Parsing per Formato
/src/services/documetsLoader/filesLoader.ts
// Router per tipo file
const parseFile = async (file: FileInput): Promise<string> => {
switch (file.mimetype) {
case 'application/pdf':
return await processPDF(file.buffer)
case 'application/vnd.openxmlformats-officedocument.wordprocessingml.document':
return await processDOCX(file.buffer)
case 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
return await processXLSX(file.buffer)
case 'text/csv':
return await processCSV(file.buffer.toString())
case 'application/json':
return await processJSON(file.buffer.toString())
case 'application/xml':
return await processXML(file.buffer.toString())
default:
return file.buffer.toString()
}
}
3. Suddivisione in Chunk
/src/services/jinaSimpleEmbeddingService.ts
// Text splitting ottimizzato
const splitContent = (content: string): string[] => {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ['\n\n', '\n', '. ', '! ', '? ', ' ', ''],
})
return splitter.splitText(content)
}
4. Generazione Metadati
/src/services/documetsLoader/filesLoader.ts
// Metadati per chunk
const generateChunkMetadata = (
chunk: string,
index: number,
total: number,
file: FileInput
): BaseQdrantMetadata<BaseQdrantMetadataType.DOCUMENT> => {
return {
source: file.originalname,
type: BaseQdrantMetadataType.DOCUMENT,
resource_uuid: file.resourceUuid,
collection_name: file.collectionName,
chunk_index: index,
total_chunks: total,
file_name: file.originalname,
file_type: file.mimetype,
chunk_size: chunk.length,
}
}
Gestione Errori
Errori di Parsing
/src/services/documetsLoader/filesLoader.ts
// Gestione errori specifici per formato
try {
const content = await parseFile(file)
} catch (error) {
if (error instanceof PDFParseError) {
logError(ctx, 'PDF parsing failed', { error: error.message })
} else if (error instanceof ExcelError) {
logError(ctx, 'Excel parsing failed', { error: error.message })
}
throw new Error(`Failed to parse ${file.mimetype}: ${error.message}`)
}
Errori di Validazione
/src/services/tasksService.ts
// Validazione dimensioni file
const MAX_FILE_SIZE = 50 * 1024 * 1024 // 50MB
if (file.size > MAX_FILE_SIZE) {
throw new Error('File size exceeds maximum allowed size')
}
Performance e Ottimizzazione
Streaming per File Grandi
/src/services/documetsLoader/filesLoader.ts
// Processing streaming per file grandi
const processLargeFile = async (file: FileInput): Promise<LoaderResult[]> => {
const stream = fs.createReadStream(file.path)
const chunks: string[] = []
return new Promise((resolve, reject) => {
stream.on('data', chunk => {
chunks.push(chunk.toString())
})
stream.on('end', async () => {
const content = chunks.join('')
const splitContent = splitText(content)
resolve(processChunks(splitContent, file))
})
})
}
Caching Metadati
/src/services/documetsLoader/filesLoader.ts
// Cache per metadati file simili
const metadataCache = new Map<string, FileMetadata>()
const getCachedMetadata = (fileHash: string): FileMetadata | null => {
return metadataCache.get(fileHash) || null
}
Configurazione
Limiti File
/src/services/tasksService.ts
// Configurazione limiti
const FILE_LIMITS = {
maxSize: 50 * 1024 * 1024, // 50MB
maxChunks: 1000,
chunkSize: 1000,
chunkOverlap: 200,
}
Tipi File Supportati
/src/services/tasksService.ts
// MIME types supportati
const SUPPORTED_TYPES = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'text/plain',
'text/csv',
'application/json',
'application/xml',
]
Monitoraggio
Metriche Importanti
- File Processed: Numero file elaborati per tipo
- Parsing Success Rate: Percentuale successo parsing
- Average Processing Time: Tempo medio elaborazione
- Chunk Generation: Numero chunk generati per file
Logging Dettagliato
/src/utilities/loggerUtility.ts
// Log per ogni fase
logInfo(ctx, 'File processing started', {
fileName: file.originalname,
fileType: file.mimetype,
fileSize: file.size,
})
logInfo(ctx, 'File parsed successfully', {
contentLength: content.length,
chunksGenerated: chunks.length,
})