Skip to main content

Bulk loading files into Sedna XML DB - part 2

In the part 1 of the article I've used scripts to generate bulk load file with LOAD instructions. But that approach has several drawbacks: existing files are not overwritten; hard to track the progress of long-term operation in case of huge number of files. I've written a better script to solve those issues.

Bash script for loading files
The following Linux Bash script uploads files one by one using separate LOAD instructions. Also it tries to remove the file first using DROP DOCUMENT instruction. As a result, existing files are overwritten. After each 100 of files being loaded, you get a message with a timestamp. It helps to predict the end time of the operation.
#!/bin/bash

# This function writes a status message to both stdout and $OUTPUT_FILE
function print_status {
  echo ">>> Loaded $counter files, time: `date`" | tee -a $OUTPUT_FILE
}

OUTPUT_FILE=load_files.log
COLLECTION_NAME=legacyBasicTypes

echo "" > $OUTPUT_FILE

counter=0
print_status

for file in products/* 
do
  shortname=`echo $file | sed "s/.*\///"`
  /appl/sedna/bin/se_term -query "DROP DOCUMENT '$shortname' IN COLLECTION '$COLLECTION_NAME'" mydatabase >> $OUTPUT_FILE
  /appl/sedna/bin/se_term -query "LOAD '$file' '$shortname' '$COLLECTION_NAME'" mydatabase >> $OUTPUT_FILE
  
  let "counter = $counter + 1"
  if (( $counter % 100 == 0 )); then
    print_status
  fi
done

print_status

Launching script and tracking progress
The following command will launch the script in the background and will prevent the script being terminated on terminal closure:
nohup ./load_files.sh &
On the screen you'll get output including status and error messages that are going to stdout and stderr system output streams. Here is an example:
>>> Loaded 0 files, time: Fri Aug 24 15:52:23 CEST 2012

SEDNA Message: ERROR SE2006
No document with this name.
Details: 74ABT126D.xml

DROP DOCUMENT '74ABT126D.xml' IN COLLECTION 'legacyBasicTypes'>>> Loaded 100 files, time: Fri Aug 24 15:54:01 CEST 2012
>>> Loaded 200 files, time: Fri Aug 24 15:55:57 CEST 2012
>>> Loaded 300 files, time: Fri Aug 24 15:57:55 CEST 2012
>>> Loaded 400 files, time: Fri Aug 24 15:59:43 CEST 2012
>>> Loaded 500 files, time: Fri Aug 24 16:00:36 CEST 2012
Finally you can also watch the log file changes using the following command:
tail -f load_files.log
This will allow you to see the results of each instruction. The following example shows the output for successful DROP and LOAD instructions:
UPDATE is executed successfully
Bulk load succeeded

Comments

Popular posts from this blog

Connection to Amazon Neptune endpoint from EKS during development

This small article will describe how to connect to Amazon Neptune database endpoint from your PC during development. Amazon Neptune is a fully managed graph database service from Amazon. Due to security reasons direct connections to Neptune are not allowed, so it's impossible to attach a public IP address or load balancer to that service. Instead access is restricted to the same VPC where Neptune is set up, so applications should be deployed in the same VPC to be able to access the database. That's a great idea for Production however it makes it very difficult to develop, debug and test applications locally. The instructions below will help you to create a tunnel towards Neptune endpoint considering you use Amazon EKS - a managed Kubernetes service from Amazon. As a side note, if you don't use EKS, the same idea of creating a tunnel can be implemented using a Bastion server . In Kubernetes we'll create a dedicated proxying pod. Prerequisites. Setting up a tunnel. ...

Notes on upgrade to JSF 2.1, Servlet 3.0, Spring 4.0, RichFaces 4.3

This article is devoted to an upgrade of a common JSF Spring application. Time flies and there is already Java EE 7 platform out and widely used. It's sometimes said that Spring framework has become legacy with appearance of Java EE 6. But it's out of scope of this post. Here I'm going to provide notes about the minimal changes that I found required for the upgrade of the application from JSF 1.2 to 2.1, from JSTL 1.1.2 to 1.2, from Servlet 2.4 to 3.0, from Spring 3.1.3 to 4.0.5, from RichFaces 3.3.3 to 4.3.7. It must be mentioned that the latest final RichFaces release 4.3.7 depends on JSF 2.1, JSTL 1.2 and Servlet 3.0.1 that dictated those versions. This post should not be considered as comprehensive but rather showing how I did the upgrade. See the links for more details. Jetty & Tomcat. JSTL. JSF & Facelets. Servlet. Spring framework. RichFaces. Jetty & Tomcat First, I upgraded the application to run with the latest servlet container versio...

Managing Content Security Policy (CSP) in IBM MAS Manage

This article explores a new system property introduced in IBM MAS 8.11.0 and Manage 8.7.0+ that enhances security but can inadvertently break Google Maps functionality within Manage. We'll delve into the root cause, provide a step-by-step solution, and offer best practices for managing Content Security Policy (CSP) effectively. Understanding the issue IBM MAS 8.11.0 and Manage 8.7.0 introduced the mxe.sec.header.Content_Security_Policy   property, implementing CSP to safeguard against injection attacks. While beneficial, its default configuration restricts external resources, causing Google Maps and fonts to malfunction. CSP dictates which domains can serve various content types (scripts, images, fonts) to a web page. The default value in this property blocks Google-related domains by default. Original value font-src 'self' data: https://1.www.s81c.com *.walkme.com; script-src 'self' 'unsafe-inline' 'unsafe-eval' ...