Format String Vulnerability in sudo
If you write a program which is run by (or as) one user and is triggerred by another (e.g., one which is SUID, or one which accepts a TCP connection, or one which a browser runs when reading web pages of a particular form, etc), you need to consider security particularly seriously. If your program loses control, it usually means your program has a security hole.
It is particularly easy to lose control in C (and C++), so be sure you understand secure programming if you write in those languages. One easy way to lose control is to use a variable format string in the printf/scanf family of functions. Recently a security hole is found in the "sudo" program, leading to root escalation by normal users. Read the following LWN article to fully understand the problem.
Virtual networking in Linux
When we set up Virtual Machine (VM) for daily use, we are often relieved with today's near native efficiency on CPU and memory usage. However, one aspect we may overlook is networking. We may get inferior efficiency if we do not configure the hypervisor/VM to use the most efficient software / hardware I/O virtualization feature available on your platform.
The following is a general introduction to virtual networking
On virtualization platforms like KVM and VirtualBox with Linux (RHEL/CentOS 5.5) guest, I often select virtio rather than emulated hardware for better efficiency.
Effective number of cores
Memory bandwidth (between CPU cores and main memory) limitation can often be a bottleneck in multi-processor, multi-core SMP machines. In
http://www.linux-mag.com/id/7855?hq_e=el&hq_m=1069497&hq_l=4&hq_v=7db03da31d
Douglas Eadline (Senior HPC Editor for Linux Magazine) defines a simple and practical measure "number of effective cores". I think his thought is very sensible. Read the article to see his results.
Avoid Memory Leaks in POSIX Thread Programming
The article
- explains the difference between joinable and detached threads.
- shows us each thread allocates 10MB memory for stack in common Linux system, so maximum number of threads (not yet recycled) is about 300 consuming 3GB memory.
- teaches us to use pmap and /proc/PID/task to monitor thread usage.
Just learnt from kkto, more precisely what's leaked is the process address space. I.e., some memory space is reserved for each thread. Apart from being used for the process to refer to the memory, the address reserved also serves as the thread identifier. The memory themselves are freed (if you run the demo program, you'll notice that the memory consumption shown by the top command is very little), but because the addresses are used to identify the thread, and because somebody might still query about the status of the threads, the addresses of the threads cannot be used to identify other threads.
In C++ programming, we often prefer to use Boost.Thread rather than using pthread directly, because of its portability (to Windows) and higher level of abstraction. We should make sure we do not create the leaks in the underlying pthread and we may use the tricks the author introduced to check.
Improving C++ Compilation Speed
From time to time, we can be impatient waiting C++ software building to complete. In addition to the reasons written in the article
I/O (writing .o file to disk, reading back and writing executable / library file during linking) usually takes a significant porting of time in the build process. Avoid using network file system for, say, the build directory, because the lot of .o files are not what you ultimately want. You can copy/install the out-coming executable or library file to a safe place afterward any way. Another tips is to set up a RAM disk for the build directory. You will need a server with enough amount of RAM but you will definitely see speed up.
Beware of the size of the binary files, because a lot of symbols may be generated in template instantiation. If the binary is extra-ordinarily large, use "nm --demangle" to inspect the symbols and look for the source in your program which lead to generation of such large amount of symbols (this can be tricky). You may need to think of tricks to get around it. Large memory and diskspace footprint are nevertheless not good to your software.
Using Lucene on a Hadoop Distributed File System
I came across an article which Clustertech colleagues may find interesting
It describes how to access data file in a Hadoop Distributed File System (HDFS), build index using Lucene and write index file back to HDFS, set up RAMDirectory under HDFS (a RAM drive on each HDFS node), and searching with index files in HDFS.
If I/O workload bottlenecks a Lucene search system, it may worth a trial to put the data in HDFS so as to distribute I/O workload.
jQuery and Blocking the User Interface
I heard some colleagues of ours use jQuery and AJAX for programming Web interface. The following article introduces a technique that lets part of the web interface to block (perhaps displaying "Work under progress") until the backend server finishes its job. Although I have not done web programming at all, I guess it may be helpful to you. So here you are:
Blog of Python author
While many of us use Python in our daily work, few of us can exploit the power features provided by the language, like:
- Meta-classes
- New-style classes
- Decorators
- Descriptors
- Generators
To change that, we have two choices: the manual and the tutorials. Unluckily, the manual is too detailed to digest for beginners, and most tutorials are so elementary that leaves readers wondering "why that is useful" or "why such a design". Here's a third choice: the original author and final decision maker of Python wrote a few articles about how some of the interesting aspects of Python are designed. Enjoy.
Searching with Solr, Part 2: From Basic Full-Text Search to Spatial Search
The previous installment [1] introduces Lucene/Solr, the Open Source Search Technology. If you haven't get the chance to read it and you are interested in it, please visit this link. The article also comes with a description of a solr demo. It will show you how to provide an indexing and searching service using Solr for an existing Intranet without modifying a single line of the Intranet's code. This sets the stage for building an preliminary infrastructure for a search facility in an enterprise Intranet.
The second installment is the continuation of the previous installment on Lucene/Solr. Its main goal is to provide enough foundation to understand how to perform spatial search as described in [2]. It will dive into the details of the "Search" capability offered by Lucene/Solr.
Colleagues who are working with Lucene/Solr are encouraged to go through the article. Hopefully, through collective learning, we can build our expertise in the Search Technology, improve the software quality and the communication between teams.
A Recap for Lucene/Solr
If you have read the previous installment and went through the demo, you should realize by now that in general, a search application is composed of 3 components namely 1) the data collection (e.g. the web crawler), 2) the full-text index and 3) the search engine. In fact, the components can be architected independent of each other as long as the component respects the interface requirement of the other component. This will improve the modularity and the scalability of the search application. Solr provides the 2nd and the 3rd components for a search application. If the consumer of the search results is a human being, the results will be rendered by another application (e.g. Drupal) before displaying the results to the actual user.
Basic Search with Solr
Let start with an understanding on how to make a search request to Solr by dissecting the URL that is used in the previous demo. To make a search request, you simply need to submit the request using HTTP GET to Solr. Essentially, you provide a URL which specifies all the search parameters in the URL's query string. For example, http://tech.clustertech.com:8983/solr/select?q=jerry&q.op=AND&qt=standard&wt=standard (internal link) means that you can locate an instance of a Solr core at http://tech.clustertech.com:8983/solr (internal link). After that, you need to specify a request handler for handling your search request and this is done by specifying /select?qt=standard which tells Solr to use the default request handler for handling this request (note that qt stands for query type/query handler). The value for the qt parameter must match to one of the query handler defined in solrconfig.xml. Currently, only the default query handler is defined in the solrconfig.xml. Other query parameters such as q, q.op and wt denote the user input query (it is composed of several search terms, in this case jerry), the boolean expression on the search terms (in this case AND boolean operator is used) and the query response writer (i.e. the output format. In this case, the standard output format is used which is by default the XML) respectively. Therefore, if you want to search my contact (Jerry's contact), you can replace the value for the q parameter with "jerry contact" in the URL. If you want to change the output format to json for example, you simply need to replace the wt value with json. Just for fun, you can try other formats such as python, ruby and java binary. For java binary, you are encouraged to use solrj client.
There are many other parameters you can specify in the URL to alter the search results but the most useful one for me is debugQuery. The result of the response will contain a lot of information about the parsed query string to help you during the debugging process. For instance, it will tell you about how the score of the result is computed which will help you to fine tune the search precision and to diagnose the search recall problem.
Many other search features are built-in to Solr, I will only go over a few of them that are relevant to the spatial search. First of all, you can influence the scoring mechanism by introducing a multiplier to a clause. For instance, you can specify q=content:jerry ho lam^0.1 which means you search for jerry in the content field either with ho or lam but you are less interested in those that are named jerry lam. Sorting with Solr is easy. You only need to specify "sort" as one of the query parameters and provide what field you would like to sort by and in what order. For example, sort=score+desc will sort the results by score in a descending order whereas sort=score+asc will sort by score in a ascending order.
The XML response format returned by Solr after a search request is submitted can be daunting to interpret by human without a little explanation. I recommend to use some browsers with built-in support for formatting XML document such as chrome. The XML response contains 2 main sections (with debugQuery disabled): responseHeader and response. responseHeader is the response header with some metadata embedded in its child nodes. For example, status specifies the search status; 0 means no problem and any nonzero value means there is a problem (it won't tell you what is the problem). QTime is the time required for Solr to process the request in millisecond. The response section of the XML response contains all the matching results by the search query. Every element (i.e. <doc>) within the response child element represents a document in the index. As an exercise, you are encouraged to explore other xml elements within the <doc> element. This is the information that is returned by Solr for other applications to consume.
Spatial Search with Solr
Now you are equipped with the basic knowledge to understand how the spatial search described in [2] works. The spatial search involves the usage of two advanced search capability of Solr namely the "Range Queries" and the "Function Queries" based on spatial information in the index. "Range Queries" limits the search results by numeric, date and text (with natural sorted ordering) ranges. Range queries can be specified in the fq value in the URL as follows: fq=lon:[-80 TO -78]&fq=lat:[31 TO 33] which creates a bounding box for the search using the specified latitude and longitude ranges. Function queries. "Function Queries" influences Lucene's scoring algorithms by allowing users to add a mathematical expression which involves indexed field values on top of the relevancy score. There are a lot of built-in mathematical functions in Solr such as sum(x,y) to add x and y, sub(x,y) to subtract x from y, etc. In the spatial search article, one of the functions it uses is the dist function which allows to boost and sort documents by distance. The dist function calculates the euclidean/Manhattan distance between the two vectors. Again, you don't need to program it. You only need to specify the function you want to apply to the q value of the URL as follows:
http://localhost:8983/solr/select/?q=name:Minneapolis AND _val_:"recip(dist(2, lat, lon, 44.794, -93.2696), 1, 1, 0)"^100
As you can see from the example above, you can nest functions in the function queries by applying recip function (a reciprocal function) to the dist function. With "Range Queries" and "Function Queries", you can search for information that is within your vicinity. For example, in http://www.roadroadguide.com/, you can provide a list of transportation locations that is closest to the users. So that they can filter out the transportations that are too far from them. In addition to the cost-based optimization for transportation, you can also add a time-based optimization on top of the search algorithm for the available choices of transportation. Later on, you might find that the distance functions provided by Solr cannot satisfy your need. You can create your own Solr plugin for your custom functions by implementing org.apache.solr.search.ValueSourceParser [3] and register your plugin in the solrconfig.xml under the <config> tag.
Summary
As you can see, using Solr, you can alter the search results without programming. For advanced scenarios, you can develop plugins for Solr to enhanced the search quality without worrying about other components in the search engine pipeline.
References
- Searching with Solr Part 1: Introducing Lucene and Solr with a Demo : http://blog.clustertech.com/trac/public/blog/searching-solr-1
- Location-aware search with Apache Lucene and Solr: http://www.ibm.com/developerworks/java/library/j-spatial/index.html
- ValueSourceParser JavaDoc: http://lucene.apache.org/solr/api/org/apache/solr/search/ValueSourceParser.html
Searching with Solr, Part 1: Introducing Lucene and Solr with a Demo
Since I know some colleagues (particularly ES team and ACE team) are experimenting Solr for one of their potential projects, I would like to take this opportunity to share my experience about Solr in this article with them. Of course, other colleagues who are eager to learn new things might also benefits from this article as well. If you want to look at a demo about Solr first, please skip the first section and go ahead to the second section of this article which provides a description of the Solr demo.
*I would like to thanks Eric and Danny allowing me to make use of the CT wiki in the demo program.*
Lucene/Solr Introduction
It is difficult to introduce Solr without first describing what Lucene is about. Lucene is a powerful and high performance information retrieval library [1]. Information retrieval (IR) refers to the process of searching for documents, information within documents or metadata about documents. Although it is an open source (Apache Software License) project, it is a mature, popular and battle-tested IR library. So the first question that many people new to Lucene will ask: "Can I have Lucene to search the files in my hard drive?" or "Can I use Lucene as my web site search engine?" The answer is "no". Lucene is not a full-featured search application. Its main focus is to do "text indexing" and "text searching". Lucene can index and make searchable any data that you can extract text from. The key point here is around "Text". Lucene deals with text only. It doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it. This means you can index and search data stored in files: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents, XML or HTML or PDF files, or any other format from which you can extract "textual information". Therefore, you can have your application to build on top of Lucene to add search capability with business rules specific to its problem domain (e.g. MS Windows Desktop File Search Engine). A number of full-featured search applications have been built on top of Lucene and one of them is called Solr.
Solr is an open source (Apache Software License) enterprise search server built on top of Lucene [2]. Similar in spirit to Google Search Appliance, Solr can serve as the central indexing and searching service in an enterprise environment. Its main focus is on searching heterogeneous content on a website, an intranet, or a content management system. Crawlers are provided by third-party with Solr integrations (e.g. Tika, Nutch, Lucene Connectors Framework, etc). Solr is running in a separate process or a separate computer, and is accessible using a standard network-based protocol. Since it is aimed to be accessible by many different kind of systems in an enterprise environment, it implements XML over HTTP API for accessing (a.k.a. RESTful API). Also, Solr has clients API in multiple programming languages, in order to interact with Solr over the network. For a list of supported clients, please visit http://wiki.apache.org/solr/IntegratingSolr. If you can't find the client for your specific needs, you can develop your own. Finally, since Solr is built on top of Lucene, it has all functionalities of Lucene and provides additional functionality beyond Lucene such as distributed search, faceted navigation, caching, replications, just to name a few. It also takes care of a lot of the administration through a Web UI such as logging control, cache utilization statistics, queries statistics, text analyzer debugger which shows results of every stage in an analyzer. For a more-completed list of features, please visit http://lucene.apache.org/solr/features.html.
The next question might be when to use Lucene or Solr? This is a tough question and a few basic tips might help you to determine which one is best for your specific need. Since Solr is built on top of Lucene, both have similar flexibility and the performance characteristics in terms of indexing and searching. The first question might be "How much (in time and money) will this solution cost me if I use Lucene/Solr?" Solr provides additional features beyond Lucene. If the solution needs those features, then using Solr will save you time. The second question might be "Do you have accessed to the source code of the application that you want to build the search functionality?" Using Lucene, you might want to integrate the IR library into the application. Without the source code of the application, developers might spend a lot of time writing infrastructure code wrapped around Lucene to make the indexing and searching to work for the application. Lastly, if the idea is to have a central indexing and searching service in an enterprise environment, Solr should be more appropriated than Lucene.
Experiments
Now I'm going to demonstrate Solr using a fictitious application that I developed. The goal of the fictitious application is to provide a search functionality for CT Intranet (i.e. http://wiki.clustertech.com) using Solr. The goal of the demo is not to suggest best practices for Solr nor it is the best solution for the problem. It is mainly for demonstrating one of the possible ways to provide a search engine for an Intranet.
Currently, I can't find my contact using the search functionality provided by CT wiki. Typing "jerry contact" in the search box of the CT wiki will not return the page with contact information. I'm not sure what exactly is the problem but let assume that the search functionality does not work quite well in the CT wiki. I would like to provide a search service on the Intranet so that people can use it to search the CT wiki which allows me to customize the search easily (at least I should be able to find the contact of myself). To provide a search service for CT wiki, I have 3 components: an intranet crawler, a search server and a web UI for displaying search results. In this case, I use Nutch as the intranet crawler, Solr as the search server and Drupal as the web UI. Note that you can choose whatever crawler or web UI you like. In my case, it makes my job easier as you will see later. The workflow is as follows:
- Nutch is used to crawl the CT wiki information.
- The crawled information is then indexed by Solr for searching later.
- Drupal provides an UI for people to searching CT wiki contents by querying Solr's index.
Some basic design considerations for the fictitious application are that:
- Nutch, Solr and Drupal are running in its own process (i.e. they can be run independently in a 3 separate computers). Therefore, their capacity can be scaled independently.
- No change is required to the CT wiki source code nor the wiki configuration.
Since both Drupal and Nutch have integration APIs for Solr [3][4], the implementation is quite straightforward. The basic setup requires only some installations of the software involved and some little tweaks to make the software work together as a whole. For now, the basic setup is sufficient for demonstration purpose. To perform search in Drupal, you can go to http://tech.clustertech.com/drupal (internal link) from any web browser (only tested on chrome browser). Just type in the search keywords in the search box and press enter, it will perform search on index stored in Solr. Results are then displayed in the UI.
For someone who wants to consume the query result directly from Solr (e.g. a system that sits between human user and Solr, it needs to render the query result in one format to the other format for display), you can try http://tech.clustertech.com:8983/solr/select?q=Search_Terms&q.op=AND&qt=standard&wt=standard (more information on http query can be found at http://lucene.apache.org/solr/tutorial.html#Querying+Data). The Search_Terms can be any string (e.g. jerry lam contact). q.op refers to if all of the search terms or just one of the search terms respectively need to match (AND or OR). wt refers to the output format (e.g. default is xml which is also aliased to standard). qt refers to query handler used for searching (in this case, we use standard which is defined in the solr configuration file). After parsing and processing the user query, Solr will return an XML which contains all the results found from the index. You can try it out other query parameters if you like ( http://lucene.apache.org/solr/tutorial.html#Querying+Data). Note that I didn't enhance the search algorithm nor I didn't make it to search Chinese explicitly but it is possible to make change to the search precision and recall to suit the application needs.
What we can learn from this experiment? First, it shows that most of the works involve integrations in this demo. For instance, in order to crawl the Intranet we need to find a suitable data connector (a.k.a. spiders). Different data sources need different data connectors which might or might not be available. If it is not available in the market, we might need to build it ourselves. Second, the demo shows that we need to map the source data to Lucene's structure. For instance, there is a schema.xml in Solr to map the source data to the following structure: host, site, url, content, title, etc. In the schema, you need to define the field types and the fields of those types that store the data. Third, upon the fulltext index is built, we need to construct the query url so Solr will return a response which contains the search results. The query url will specify the fields (corresponding to the field defined in the schema) that are to be returned in the response, the way that the result should be sorted and the output format a.k.a. the writer type (e.g. xml, json, java binary, python, etc). As you can see, the response is very application specific. In my case, I'm using json as the output format (as required by Drupal Solr integration), and having the fields returned and query parameters that meet Drupal's presentation layer's requirements.
What I didn't show in this demo that might be of interest to some ACE and BIC colleagues are as follows: the text analysis (i.e. tokenization, stemming, synonyms, etc that are used to process input text for a field during indexing and searching) and scoring (i.e. the measure of the similarity between a query and each document that matches the query). As they already know from Lucene, both of them can be tuned to suit the application of interest. For ES colleagues, they might be more interested in the deployment and the optimization of Solr to meet the application's SLA and the query throughput requirement in a production environment.
Finally, I want to emphasize how important is to understand the requirement. Once we have a good understanding of the requirement, we can estimate the complexity of the project and the time required to deliver the project with the available resource. Knowing which systems are involved (in this demo, Intranet and Drupal) and having some basic user stories are usually good enough to estimate the project complexity. For this demo, I use 2 days to build it from the requirement to the design and then to the implementation. Whereas, for more complex scenarios, it can take months or years.
I hope you enjoy it.
References
- Lucene: http://lucene.apache.org/java/docs/
- Solr: http://lucene.apache.org/solr/
- Solr Nutch integration: http://wiki.apache.org/nutch/RunningNutchAndSolr
- Apache Solr Search Integration with Drupal: http://drupal.org/project/apachesolr
Kernel Shared Memory (KSM): Avoid keeping many copies of the same page in RAM
Every tech company is turning physical machines into virtual machines, especially if their CPUs are not busy. Here, RAM is nearly always the bottleneck: a server with 16G RAM can hold only 15 VMs of 1G (virtual) RAM. Or is it really the case?
Take our alpha01 as an example, which host 9 VMs for us. It has 16G RAM, and the total of virtual RAM of VMs there sums to 12G. But the actual amount of RAM used in it is just 9.15G. In other words, nearly 3G is saved somewhere, a saving of 24%.
How? Many VMs there run exactly the same OS and mostly the same software. So the content of many memory pages are exactly the same. They can be merged into one page, and with Copy-On-Write (COW), the VMs don't know about it and thus runs fine. But how the host can actually find the page with same contents, and merge them and apply COW? The difficulty is that the different VMs are all running different images, so nothing tells which pages are the same.
The answer is KSM, a kernel daemon actively scanning the memory contents of the VMs and merging them whenever there is a good opportunity. Here is an article targeting system administrators and developers, giving a very good overview of how this is achieved.
The best bit? The interface for asking the kernel to do that is actually quite simple, available to all application programs, not just VM implementations. So if you have your own application which can be benefited from such memory optimisations, a few calls to madvise, a configuration to your kernel, and you're set.
Here's the article. Enjoy. http://www.ibm.com/developerworks/linux/library/l-kernel-shared-memory/index.html
Memory management thread speedup programs for 20%
Researchers found that malloc() and free() usually take up to one-third execution time of program. C++ programs may use significantly more dynamic memory allocations compared to C programs. The authors of the paper below innovated in putting memory management to a separate thread and adding other optimization techniques. Their approach speeds up programs for ~20% on multi-core CPU. Great simple idea!
Paper: http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf
In Clustertech, we have also implemented similar idea to our clients. In our parallelization of HKO's SWIRL (which predicts HK's heavy rain), we find that write-to-disk operations block the (already MPI-parallelized) computation for significant portion of time. We separate the write-to-disk operation to a specific MPI process and let other MPI processes continue computation without waiting for the completion of the write operation. It saves time a lot (Ask Nin how much, if he can remember) and in this way, HK citizens get Red/Black Rain Signal earlier to prepare for evading (from office to home:))
Agile vs waterfall software development
I find this IEEE Software article interesting. It analyzes agile vs waterfall software development methodology.
More paper:
A short tutorial on using JSON (JavaScript Object Notation) in Web application
Audience: Colleagues who are interested in programming web application. Beginner's level.
This article introduces the JSON data format and parsing it in Java. Though the example in this article is for Android (mobile phone), it introduces the org.json.* Java packages which is applicable also for web application programming on the server side.
Algorithm breakthrough for Uncertainty Quantification
This article is more suited for the taste of BIC team members or colleagues who are interested in algorithm.
Uncertainty quantification involves computing the diagonal of inverse covariance matrices is of paramount importance. Conventional techniques has a computation complexity of O(n3). IBM invented a method which is O(n2) for this problem, together with other clever techniques that make the algorithm parallelized, fault-tolerant, and can use single-precision computation (suitable for GPU) and iterative technique for improving accuracy. The resulting implementation can reach 73% efficiency of CPU, a rather high utilization of CPU in real-world application.
Potential applications of such algorithm: weather forecasting, supply chain management, business intelligence, financial portfolio analysis and others.
News: http://www.hpcwire.com/features/IBM-Invents-Short-Cut-to-Assessing-Data-Quality-85427987.html
Paper and Abstract: http://delivery.acm.org/10.1145/1650000/1645421/a8-bekas.pdf?key1=1645421&key2=8721717621&coll=GUIDE&dl=GUIDE&CFID=77531079&CFTOKEN=42017699
Intel produced 48-core CPU prototype
Philip shares with us this article http://spectrum.ieee.org/semiconductors/processors/intel-lifts-the-hood-on-its-singlechip-cloud-computer. To effectively utilize 48 cores on a chip, Intel Labs has innovated to add "routers" for each pair of cores to transfer data to other cores within the chip, eliminating the access to main memory. To save silicon and power, this chip design relies heavily on software, like controlling data flow in caches and changing voltage and frequency.
Like previous revolutions in CPU architecture, software support is very important. Let see whether compilers and software developers can catch up to effectively utilize this new hardware.
Anatomy of the libvirt virtualization library
Isaac shared this article (http://www.ibm.com/developerworks/linux/library/l-libvirt/index.html) with Tech Team. I think that it is a good article to share with general colleagues, esp for those who attended the KVM STP talk (Internal link). When using KVM, we often use virsh and libvirt's other utilities like virt-install and virt-clone. Actually, libvirt also provides an API to manage virtual machines by programming. Quoting Isaac's intro:
This article talks about the capability of the library, gets into its architecture, shows command line usage, shows how to use the Python binding, and touch a bit on its API. Probably interesting for many of us, see if you are included.

rss