Skip to main content

Posts

Extracting XML comments with XQuery

I've just discovered that it's possible to process comment nodes using XQuery. Ideally it should not be the case if you take part in designing your data formats, then you should simply store valuable data in plain xml. But I have to deal with OntoML data source that uses a bit peculiar format while export to XML, i.e. some data fields are stored inside XML comments. So here is an example how to solve this problem. XML example This is an example stub of one real xml with irrelevant data omitted. There are several thousands of xmls like this stored in Sedna XML DB collection. Finally, I need to extract the list of pairs for the complete collection: identifier (i.e. SOT1209 ) and saved timestamp (i.e. 2012-12-12 23:58:13.118 GMT ). <?xml version="1.0" standalone="yes"?> <!--EXPORT_PROGRAM:=eptos-iso29002-10-Export-V10--> <!--File saved on: 2012-12-12 23:58:13.118 GMT--> <!--XML Schema used: V099--> <cat:catalogue xmlns:cat=...

IntelliJ IDEA Compiler Excludes issue with generated sources

I've recently got a fresh new licensed IntelliJ IDEA 12 and have been so glad about it until I've suddenly stumbled upon a strange issue. The Java project that was being developed successfully in previous versions of IDEA crashed during building this time. Shortly, the solution was hidden under IDEA Settings Compiler.Excludes where the JAXB generated sources directory was excluded due to some unknown reason. Below are the details and the screenshots. Symptoms of issue Here are the symptoms of the issue. Whenever the sources directory is excluded from compiling, it's marked with cross signs. See generated directory below: This issue results in numerous "cannot find symbol" errors during compilation: Settings Compiler.Excludes Here is the screenshot with the solution for this issue. You just need to delete the item with excluded sources and they will again magically appear in the classpath. Update - the root cause found After a while I real...

Linux command line tips and tricks

This post lists a number of useful tips and tricks from my daily Linux experience. Mostly I deal with RHEL but I believe these commands are quite independent on Linux distribution (or can be adapted). Network commands Here are network commands represented. Basic net utils: # Who is listening to port: netstat -lp | grep <port> # Show all connections with numeric addresses and proc IDs: netstat -anp # Listen to port (to check connectivity from another side): netcat -l -p <port> # -or- nc -l -p <port> SSH tunnel: # Tunnel to remote_ip:remote_port via proxy_ip with known login/password # The remote_ip:remote_port is being redirected to localhost:local_port ssh -L local_port:remote_ip:remote_port login@proxy_ip # Real-world example of tunnel to remote Sedna XML DB: ssh -L 5050:134.27.100.67:5050 pxqa1@134.27.100.67 Download via HTTP proxy with wget: # Download resource from internet from behind a proxy: http_proxy=http://host:port ; export http_proxy ; w...

Extracting collection from Sedna XML DB

This post is actually based on a kind of an epic fail story. Initially the task was just to rename a collection in Sedna XML DB . The solution is as primitive as using RENAME COLLECTION statement of Sedna Data Definition Language. But I'm probably too enthusiastic about writing Bash scripts in Linux. So I missed out single-statement solution and wrote a bunch of scripts to perform the same task via extracting-loading procedure. Anyway, it can still be quite valuable for more complex tasks like moving a collection between XML DB installations (e.g. from Production to Test environment) or merging collections. So my solution follows below. Extracting a single file It's always wise to modularize the code and divide a task into smaller parts. First, we need a script for extracting a single file. It need be parametrized with a file name and a collection name. Also I address another essential problem here that is the safety of file names. It's not a common problem but we do...

Using JavaScript hashCode to enable Cocoon caching of POST requests

I've just faced an issue with the Cocoon caching related to POST requests. Let me describe the use case here. We use a custom XQueryGenerator to execute XQuery code over Sedna XML Database and then process the XML results in the Cocoon pipeline. For the sake of performance, I configured the pipeline caching based on the expiration timeout of 60 seconds for all XQuery invocations: <map:pipeline id="cached-services" type="expires" internal-only="true"> <map:parameter name="cache-expires" value="60"/> <map:parameter name="cache-key" value="{request:sitemapURI}?{request:queryString}"/> <map:match pattern="cached-internal-xquery/**"> <map:generate src="cocoon:/xquery-macro/{1}" type="queryStringXquery"> <map:parameter name="contextPath" value="{request:contextPath}"/> </map:generate> ...

Bulk loading files into Sedna XML DB - part 2

In the part 1 of the article I've used scripts to generate bulk load file with LOAD instructions. But that approach has several drawbacks: existing files are not overwritten; hard to track the progress of long-term operation in case of huge number of files. I've written a better script to solve those issues. Bash script for loading files The following Linux Bash script uploads files one by one using separate LOAD instructions . Also it tries to remove the file first using DROP DOCUMENT instruction . As a result, existing files are overwritten. After each 100 of files being loaded, you get a message with a timestamp. It helps to predict the end time of the operation. #!/bin/bash # This function writes a status message to both stdout and $OUTPUT_FILE function print_status { echo ">>> Loaded $counter files, time: `date`" | tee -a $OUTPUT_FILE } OUTPUT_FILE=load_files.log COLLECTION_NAME=legacyBasicTypes echo "" > $OUTPUT_FILE counter=0...

Bulk loading files into Sedna XML DB

The problem is to upload plenty of files into Sedna XML DB . How would you do this? If it is a repeated action, it's logical to create an application for this. This is quite easy using Sedna XML:DB Java API . Actually we've already done so but this article addresses another case. There is a problem using Java API that is the performance. Using Java API always brings overhead compared to using embedded terminal utility (I got the performance of 2 seconds per file with the remote Sedna installation). Now I have several thousands of files and I want to upload them fast so let's turn to writing some useful scripts to automate it. Generate bulk load file First we need to generate an xquery file with LOAD instructions that are supported by Sedna terminal utility. Let's do this with another simple script. I had to do this under both Linux and Windows systems so you'll find two scripts below. First comes the Linux shell script: #!/bin/sh OUTPUT_FILE=bulk_load.xque...